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Abstract 

BerkeleyGW is a massively parallel computational package for electron excited- 
state properties that is based on many-body perturbation theory employing 
the ab initio GW and GW plus Bethe-Salpeter equation methodology. It can 
be used in conjunction with many density-functional theory codes for ground- 
state properties, including PARATEC, PARSEC, Quantum ESPRESSO, SIESTA, 
and Octopus. The package can be used to compute the electronic and opti- 
cal properties of a wide variety of material systems from bulk semiconductors 
and metals to nanostructured materials and molecules. The package scales 
to 10000s of CPUs and can be used to study systems containing up to 100s 
of atoms. 

Keywords: Many-Body Physics, GW, Bethe-Salpeter Equation, 
Quasiparticle, Optics, Exciton 
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Figure 1: The logo for the BerkeleyGW code. 

No. of bj^es in distributed program, including test data, output, etc.: 
200MB 

Distribution format: tar 

Programming language: Fortran 90, C, C++, Python, Perl, BASH 
Libraries required: BIAS, LAPACK, FFTW, ScaLAPACK (optional), 

MPI (optional). All available under open-source licenses. 

Memory required: (50-2000) MB per CPU (Highly dependent on system 

size) 

Computers for which the program has been designed and others on which 
it has been operable: Linux/UNIX workstations or clusters 

Operating systems under which the program has been tested: Tested on 
a variety of Linux distributions in parallel and serial as well as AIX and Mac 
OSX. 

Nature of problem: The excited state properties of materials involve the 
addition or subtraction of electrons as well the optical excitations of electron- 
hole pairs. The excited particles interact strongly with other electrons in a 
material system. This interaction affects the electronic energies, wavefunc- 
tions and lifetimes. It is well known that ground-state theories, such as stan- 
dard methods based on density-functional theory, fail to correctly capture 
this physics. 

Solution method: We construct and solve Dyson's equation for the quasi- 
particle energies and wavefunctions within the GW approximation for the 
electron self energy. We additionally construct and solve the Bethe-Salpeter 
equation for the correlated electron-hole (exciton) wavefunctions and excita- 
tion energies. 

Restrictions: The material size is hmited in practice by the computational 
resources available. Materials with up to 500 atoms per periodic cell can be 
studied on large HPCs. 
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Running time: 1-1000 minutes (depending greatly on system size and 
processor number) 



2. Introduction 

Over the last few decades, the ab initio GW methodology has been suc- 
cessfully applied to the study of the quasiparticle properties of a large range of 
material systems from traditional bulk semiconductors, insulators and met- 
als to, more recently, nano-systems like polymers, nano-wires and molecules 
[H, li], y, 0, lEj . The GW approach, which is based on approximating the elec- 
tron self energy as the first term in an expansion in the screened Coulomb 
interaction, W 0, has proven to yield quantitatively accurate quasiparticle 
band gaps and dispersion relations from first principles. 

Additionally, the Bethe-Salpeter equation (BSE) approach to the optical 
properties of materials has proven exceptionally accurate in predicting the 
optical response of a similarly large class of materials employing an electron- 
hole interaction kernel derived within the same level of approximations as 
GW 0,ii[l3- 

The combined GW-BSE approach is now arguably regarded as the most 
accurate methodology commonly used for computing the quasiparticle and 
optical properties of condensed-matter systems. A perceived drawback of the 
GW methodology is its computational cost; a GW-BSE calculation is usually 
thought to be an order of magnitude (or worse) more costly than a typical 
density functional theory (DFT) calculation for the same system. Since the 
pioneering work of Ref . [ll , many GW implementations have been made, but 
most are limited to small systems of the size of 10s of atoms, and scaling to 
only small numbers of CPUs on the order of 100. 

BerkeleyGW is a massively parallel computer package written predom- 
inantly in F0RTRAN90 that implements the ab initio GW methodology of 
Hybertsen and Louie [l| and includes many more recent advances, such as 
the Bethe-Salpeter equation approach for optical properties j^. It alleviates 
the restriction to small numbers of atoms and scales beyond thousands of 
CPUs. The package is intended to be used on top of a number of mean- 
field (DFT and other) codes that focus on ground-state properties, such as 
PARATEC Quantum ESPRESSO [l2|, SIESTA [lij, PARSEC Qllij, Octopus 



IgI firt and an empirical pseudopotential code (EPM) included in the pack- 
age (based on TBPW 18||). More information about BerkeleyGW, the lat- 
est source code, and help forums can be found by visiting the website at 
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http : / /berkeleygw . org/. 



3. Theoretical Framework 



The ab initio GW-BSE approach is a many-body Green's-function method- 
ology in which the only input parameters are the constituent atoms and 
the approximate structure of the system [l], [8|. Typical calculations of the 
ground- and excited-state properties using the GW-BSE method can be bro- 
ken into three steps: (1) the solution of the ground-state structural and 
electronic properties within a suitable ground-state theory such as ab initio 
pseudopotential density-functional theory, (2) the calculation of the quasi- 
particle energies and wavefunctions within the GW approximation for the 
electron self-energy operator, and (3) the calculation of the two-particle cor- 
related electron-hole excited states through the solution of a Bethe-Salpeter 
equation. 

DFT calculations, often the chosen starting point for GW, are performed 
by solving the self-consistent Kohn-Sham equations with an approximate 
functional for the exchange-correlation potential, ^ common approxima- 
tions being the local density approximation (LDA) [l9| and the generalized- 
gradient approximation (GGA) (2o| : 



1. 



DFT 
xc 



„/,DFT 



pDFT„/,DFT 



(1) 



where E^^^ and i'n^^ Kohn-Sham eigenvalues and eigenfunctions 

respectively, Vion is the ionic potential, Vh is the Hartree potential and \4c is 
the exchange-correlation potential within a suitable approximation. When 
DFT is chosen as the starting point for GW, the Kohn-Sham wavefunctions 
and eigenvalues are used here as a first guess for their quasiparticle coun- 
terparts. The quasiparticle energies and wavefunctions {i.e., the one-particle 
excitations) are computed by solving the following Dyson equation 21. [l| in 
atomic units: 



Vh + 2(B2r) 



(2) 



where S is the self-energy operator within the GW approximation, and E^^ 
and ipnk quasiparticle energies and wavefunctions, respectively. For 

systems of periodic dimension less than three, the Coulomb interaction may 
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be replaced by a truncated interaction. The interaction is set to zero for 
particle separation beyond the size of the system in order to avoid unphys- 
ical interaction between the material and its periodic images in a super-cell 



22[ calculation. The electron-hole excitation states (probed in optical or 
other measurements) are calculated through the solution of a Bethe-Salpeter 
equation js], 0| for each exciton state S: 



{El - ES!)Ai^ + Y: {vck\K^>'cV) = n'Ai^ (3) 

t'c'k' 



where A^^k exciton wavefunction (in the quasiparticle state represen- 

tation), [2^ is the excitation energy, and K'^^ is the electron-hole interaction 
kernel. We make the Tamm-Dancoff approximation by including only valence 
— 7- conduction transitions js], [isj. The exciton wavefunction can be expressed 
in real space as: 

^(re, rh) = J2 ^ik^k,c(re)^k,.(rh), (4) 

k,c,D 

and the imaginary part of the dielectric function, if one is interested in optical 
properties, can be expressed as 

e,(c.) = i^5^|e.(0|v|5)|^5(a;-r?^) (5) 

^ s 

where e- (OlvIS*) is the velocity matrix element along the direction of the po- 
larization of light, e. One may compare this to the non-interacting absorption 
spectrum: 

e.iu^) = ^ E h • W-\ck)\X- - El + E^:). (6) 

tick 

An example absorption spectrum for silicon computed with the BerkeleyGW 
package at the GW and GW-BSE levels is shown in Fig. [21 Only when both 
the quasiparticle effects within the GW approximation and the excitonic ef- 
fects through the solution of the Bethe-Salpeter equation are included is good 
agreement with experiment reached. 
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Figure 2: The absorption spectra for silicon calculated at the GW (black dashed) and 
GW-BSE (red solid) levels using the BerkeleyGW package. Experimental data from p3 |. 
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Figure 3: Flow chart of a GW-BSE calculation performed in the BerkeleyGW package. 
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4. Computational Layout 



4.1. Major Sections of the Code 

Figure [3] illustrates the procedure for carrying out an ab initio GW- 
BSE calculation to obtain quasiparticle and optical properties using the 
BerkeleyGW code. First, one obtains the mean-field electronic orbitals and 
eigenvalues as well as the charge density. One can utilize one of the many 
supported DFT codes ll|, [l^ ll3|, |l5|, ll7| to construct this mean-field starting 
point and convert it to the plane-wave BerkeleyGW format (see Appendix) 
using the wrappers included. (Note that norm- conserving pseudopotentials 
must be used, or else extra contributions would need to be added to our 
matrix elements.) 

The Epsilon executable produces the polarizability and inverse dielectric 
matrices. In the epsilon executable, the static or frequency-dependent po- 
larizability and dielectric function are calculated within the random-phase 
approximation (RPA) using the electronic eigenvalues and eigenfunctions 
from a mean-field reference system. The main outputs are files epsOmat 
and epsmat that contain the inverse-dielectric matrix for q — )■ and q 7^ 0. 

In the Sigma executable, the screened Coulomb interaction, W, is con- 
structed from the inverse dielectric matrix and the one-particle Green's func- 
tion, G, is constructed from the mean-field eigenvalues and eigenfunctions. 
We then calculate the diagonal and (optionally) off-diagonal elements of the 
self-energy operator, E = iGW, as a matrix in the mean-field basis. In 
many cases, only the diagonal elements are sizable within the chosen mean- 
field orbital basis; in such cases, in applications to real materials, the effects 
of S can be treated within first-order perturbation theory. The sigma ex- 
ecutable evaluates S in the form S = V^c + (S — V^c), where V^c is the 
independent-particle mean-field approximation to the exchange- correlation 
potential of the chosen mean-field system. For moderately correlated elec- 
tron systems, the best available mean-field Hamiltonian may often be taken 



to be the Kohn-Sham Hamiltonian [19|. However, many mean-field starting 



points are consistent with the BerkeleyGW package, such as Hartree-Fock, 
static COHSEX and hybrid functionals. In principle, the process of correct- 
ing the eigenfunctions and eigenvalues (which determine W and G) could 
be repeated until self-consistency is reached or the S matrix diagonalized in 
full. However, in practice, it is found that an adequate solution often is ob- 
tained within first-order perturbation theory on Dyson's equation for a given 



E |25l . |26[. Comparison of calculated energies with experiment shows that 
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this level of approximation is very accurate for semiconductors and insulators 
and for most conventional metals. The outputs of the sigma executable are 
E^^ , the quasiparticle energies, which are written to the file eqp.dat using 
the eqp.py post-processing utility on the generated sigma.log files for each 
sigma run. 

The BSE executable, kernel, takes as input the full dielectric matrix 
calculated in the epsilon executable, which is used to screen the attractive 
direct electron-hole interaction, and the quasiparticle wavefunctions, which 
often are taken to be the same as the mean-field wavefunctions. The direct 
and exchange part of the electron-hole kernel are calculated and output into 
the bsedmat and bsexmat files respectively. The absorption executable uses 
these matrices, the quasiparticle energies and wavefunctions from a coarse It- 
point grid GW calculation, as well as the wavefunctions from a fine k-point 
grid. The quasiparticle energy corrections and the kernel matrix elements 
are interpolated onto the fine grid. The Bethe-Salpeter Hamiltonian, con- 
sisting of the electron-hole kernel with the addition of the kinetic-energy 
term, is constructed in the quasiparticle electron-hole pair basis and diago- 
nalized yielding the electron-hole amplitude, or exciton wavefunctions, and 
excitation energies, printed in the file eigenvectors. Exciton binding en- 
ergies can be inferred from the energy of the correlated exciton states rel- 
ative to the inter-band-transition continuum edge. With the excitation en- 
ergies and amplitudes of the electron-hole pairs, one then can calculate the 
macroscopic dielectric function for various light polarizations which is writ- 
ten to the file absorption_eh.dat. This may be compared to the absorption 
spectrum without the electron-hole interaction included, printed in the file 
absorption_noeh . dat. 

Example input files for each executable are contained within the source 
code for the package, as well as complete example calculations for silicon, the 
(8,0) and (5,5) single- walled carbon nanotubes (SWCNTs), the CO molecule, 
and sodium metal. There are several post-processing and visualization utili- 
ties included in the package that are described in Sec. M 

Additionally, sums over k and q are accompanied by an implicit division 
by the volume of the super-cell considered, Vsc = iVfcKic, where Nk is the 
number of points in the k-grid and V^c is the volume of the unit cell in a 
periodic system. 

Throughout the paper, we refer to benchmark numbers from calculations 
on the (20,20) SWCNT. This system has 80 carbon atoms and 160 occupied 
bands. We use 800 unoccupied bands in all sums requiring empty orbitals. 
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51 au 




Figure 4: The cross section of the (20,20) SWCNT used throughout the paper as a bench- 
mark system. 



Step 


# CPUs 


CPU hours 


Wall hours 


DFT Coarse 


64 X 32 


19000 


9.1 


DFT Fine 


64 X 256 


29000 


1.8 


epsilon 


1600 X 32 


61000 


1.2 


Sigma 


960 X 16 


46000 


3.0 


kernel 


1024 


600 


0.6 


absorption 


256 


500 


2.0 



Table 1: Breakdown of the CPU and wah-clock time spent on the calculation of the (20,20) 
SWCNT with parameters described in the text. The x indicates an additional level of 
trivial parallelization over the k- or q-points. 

We use a super-cell of size 80 x 80 x 4.6 au^ equivalent to a bulk system of 
greater than 500 atoms. We use a 1 x 1 x 32 coarse k-grid and a 1 x 1 x 256 
fine k-grid. We calculate the self-energy corrections within the diagonal 
approximation for 8 conduction and 8 valence bands. The Bethe-Salpeter 
equation is solved with 8 conduction and 8 valence bands. The relative 
costs of the various steps in the GW-BSE calculation using the BerkeleyGW 
package is shown in Table [T] As can be seen from the table, the actual time 
to solution for the GW-BSE part of the calculation is smaller than that of 
the DFT parts. 
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4-2. RPA Dielectric Matrix: epsilon 

epsilon is a standalone executable that computes either the static or dy- 
namic RPA polarizability and corresponding inverse dielectric function from 
input electronic eigenvalues and eigenvectors computed in a suitable mean- 
field code. As we discuss in detail below, the input electronic eigenvalues and 
eigenvectors can come from a variety of different mean-field approximations 
including DFT within LDA/GGA, generalized Kohn-Sham hybrid- functional 
approximations as well as direct approximations to the GW Dyson's equa- 



27j approximation and the Hartree-Fock 



tion such as the static-COHSEX |21l . 
approximation. 

We will first discuss the computation of the static polarizability and the 
inverse dielectric matrix. The epsilon executable computes the static RPA 
polarizability using the following expression [l|: 

occ omp 

XGG'(q;0)= 5^5^5^Ml,(k,q,G)M„„,(k,q,G')^^ (7) 

n n' k ^ 

where 

M„„,(k, q, G) = (nk+q| e*('i+^)-^ |n'k) (8) 

are the plane-wave matrix elements. Here q is a vector in the first Brillouin 
zone, G is a reciprocal-lattice vector, and (nk| and Enk are the mean- field 
electronic eigenvectors and eigenvalues. The matrix in Eq. [7] is to be evalu- 
ated up to |q-|- Gp, |q+ G'p < Ecut where Ecnt defines the dielectric energy 
cutoff. The number of empty states, n', included in the summation must be 
such that the highest empty state included has an energy corresponding to 
Ecnt- There are therefore not two convergence parameters, but only one, in 
evaluating Eq. [T) one either must choose to converge with empty states or 
with the dielectric energy cutoff and set the remaining parameter to match 
the chosen convergence parameter. The epsilon code itself reports the con- 
vergence of Eq. [7]in an output file called chi_converge.dat (plotted in Fig. 
[5]), that presents the computed value of Xgg'=o(cI)0) and XGG'=Gmax(qj 0) 
using partial sums in Eq. [7] where Gmax is the largest reciprocal-lattice vec- 
tor included, and the number of empty states is varied between 1 and the 
maximum number requested in the input file, epsilon. inp. A simple ex- 
trapolation also is included. 

With the expression for x above, we can obtain the RPA dielectric matrix 

as 

eGG'(q; 0) = "^GG' - ^(q+G)xGG'(q; o) (9) 
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Figure 5: Example convergence output plotted from chi_converge.dat showing the con- 
vergence of the sum in Eq. [7] for the G,G' = and q = (0,0,0.5) component of x in 
ZnO. 



where f (q+G) is the bare Coulomb interaction defined as: 



v{q+G) 



|q+G| 



(10) 



in the case of bulk crystals where no truncation is necessary. We discuss in 
Sec. M how to generalize this expression for the case of nano-systems where 
truncating the interaction in non-periodic directions greatly improves the 
convergence with super-cell size. 

It should be noted that we use an asymmetric definition of the Coulomb 
interaction, as opposed to symmetric expressions such as 



t;(q+G,q + G') 



|q+G| |q+G'| 



(11) 



This causes €QG'{q;0) and Xgg'(^;0) to be also asymmetric in G and G'. 
This asymmetry is resolved when constructing the static screened Coulomb 
interaction by use of the expression: 



W, 



GG' 



(12) 



Here W is symmetric in G and G' even though both v and e ^ individually 
are not. 
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The computation of e^^,(q;0) in the epsilon code involves three com- 
putationally intensive steps: the computation of the matrix elements needed 
for the summation in Eq. [71 the summation itself and the inversion of the 
dielectric matrix to yield e^^,(q;0). The epsilon code first computes all 
the matrix elements M„„/ required in the summation for Eq. [71 This step 
is generally the most time-consuming step in the execution of the epsilon 
code. Naively, one might think this process scales as N^, where is the 
number of atoms in the system. This is because both the number of valence 
and conduction bands needed scales linearly with N and the number of G 
vectors scales linearly with the cell volume which itself scales linearly with 
the number of atoms. Thus, we must calculate matrix elements each of 
which involves a sum over the plane-wave basis set for the eigenf unctions. 
We therefore have an scaling. However, we can achieve log N scaling 
by using fast Fourier transforms (FFTs), noting that the expression in Eq. 
m is a convolution in Fourier space jisj . Therefore, Eq. [HI can be written as 
the Fourier transform of a direct product of the wavefunctions in real space: 

M„„,(k, q, {G}) = FFT-i (C,k+q(r)0n',k(r)) . (13) 

The FFTs are implemented with FFTW [29| and scale as A^logA^. The 
computation of all the matrix elements needed for Eq. [71 therefore scales as 
A^'^ log A^. We discuss in the following sections that the computation of these 
matrix elements can be parallelized very trivially up to tens of thousands of 
CPUs. Given an infinite resource of CPUs, our implementation would have 
a wall-time scaling of N log N, nearly linear in the number of atoms. 

Having computed the individual matrix elements required in Eq. [71 we 
now turn our attention to the summation involved in the same expression. 
It should be noted that the formal scaling of this step with the number of 
atoms is N"^ since one must sum over the number of occupied bands and the 
number of unoccupied bands for every G and G' pair - each one of these 
quantities scales linearly with the number of atoms. This step therefore for- 
mally has the worst scaling of the entire GW process - leading many to claim 
that GW as a whole scales like A^^. However, in practice for most systems 
currently under study within a generalized plasmon-pole (GPP) jl| or other 
approximation where this sum is done only once for the static polarizability, 
this step represents less than 10 percent of a typical calculation time even 
for systems of 100s of atoms because this step can be optimized and paral- 
lelized greatly. In particular, Eq. [71 can be written very compactly as a single 
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matrix- matrix product for each q: 

XGG'(q; 0) = M*(G, q, (n, n', k)) ■ M^{G', q, (n, n', k)) (14) 

where (n, n',k) represents a single composite index that is summed over as 
the inner dimension in the matrix- matrix product. The matrices M can be 
expressed in terms of the matrix elements M as: 

M(G, q, (n, n', k)) = M„„,(k, q, G) ■ (15) 

The single dense matrix-matrix product required in Eq. [H] still scales 
as N'^ since the inner dimension, (n, n',k), scales as A^^ and dense matrix 
multiplication itself scales as A^^. However, in the BerkeleyGW package, 
this single step is still made quite rapid for even systems as large as 100s of 



atoms. The LEVEL 3 BLAS [30| hbraries DGEMM and ZGEMM and their parallel 
analogues are used to compute the single matrix product in Eq. [HI As we 
discuss further in Sec. I5.lt in the evaluation of Eq. El the parallel wall-time 
scaling is A^^ with the number of atoms. 

Finally, once we have constructed Xgg'{^.', 0) we can construct the RPA 
dielectric matrix and inverse dielectric matrix required for the computation of 
the screened Coulomb interaction, W. The dielectric matrix as implemented 
in the code is expressed in Eq. HI 

Here we require for the first time the Coulomb interaction in reciprocal 
space z;(q+G), which can be computed trivially from Eq. [10] for the case of 
bulk crystals, but requires an FFT for the case of nanostructured materials. 
We discuss this more in Sec. |6l 

There is a clear problem in directly computing eoo(q = 0) due to the fact 
that the Coulomb interaction, Eq. diverges as q — )■ except in the case 
of box- type truncation schemes (see Sec. [H]). For semiconducting systems, 
due to orthogonality, the matrix elements (Eq. [8]) themselves go to with 
the form |M„„/(k, q, G = 0)| oc Thus e(q — )■ 0) contains a non-trivial 
g^/g^ limit. One way to handle this would be to take the limit of Eqs. [7] and 
E] analytically via k • p perturbation theory, where the perturbation is the 
momentum operator —iV plus the commutators with the non-local potential 
of the mean-field Hamiltonian 31, [H. This is analogous to the treatment of 



the velocity operator in absorption (Eq. H6|) . 

The epsilon code has implemented a simpler scheme, however, in which 
we numerically take the limit as q — )■ by evaluating eoo(qo) at a small but fi- 
nite qo usually taken as approximately 1/lOOOth of the Brillouin zone, in one 
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of the periodic directions. For semiconducting systems, where eoo(q = 0) — )■ 
C, it is sufficient to construct a separate k-grid for the conduction and valence 
bands shifted by the small vector qo in order to compute M„„/(k, qo, G = 0), 
where n is a valence and n' a conduction band, and to evaluate the cor- 
rect limiting q^/q^ ratio. For metals, however, intra-band transitions have 
|M„„/(k, q, G = 0)1 oc C, yielding eoo(q— j-O) oc C'/q^. In this case, the 
two- k-grid treatment is insufficient, because the proportionality coefficient 
C depends sensitively on the density of states (DOS) at the Fermi energy. 
Therefore a k-grid sampling of the same spacing as qo is required, although 
fewer conduction bands are necessary in the sum since e(q — )■ 0) is domi- 
nated by intra-band transitions. Thus we typically calculate e(q — > 0) using 
a single fine wavefunction grid by using the smallest q consistent with the 
grid. Note that this treatment of intra-band transitions is still the zero- 
temperature limit in our code, as the effect of thermal occupations is small 



in GW except at very large temperatures [32[. Effectively occupations are 
taken as one below the Fermi level, zero above the Fermi level, and 1/2 at the 
Fermi level (as needed for graphene at the Dirac point). This is despite any 
smearing that may have been used in the underlying mean-field calculation. 
We should point out that only one qo is used; if the material is anisotropic 
(in periodic directions), in principle an average over the three directions of 
qo should be done. This may be accomplished by using a vector in the (111) 
direction (referred to the principal axes of e). Neglect of the anisotropy can 
give significant errors in sigma [33]. 

The inversion of the dielectric matrix required to compute W , Eq. [121 
is done with LAPACK and ScaLAPACK (for parallel calculations) using 
ZGESV, DGESV and their parallel counterparts. The inversion scales like A^^ 
with the number of atoms and, as we discuss below, scales well up to 100s of 
processors with ScaLAPACK. In general, for systems of up to 100s of atoms, 
the inversion step represents less than 10 percent of the total computation 
time for epsilon. 

We have so far limited ourselves to situations in which only a direct 
calculation of the static polarizability, Eq. [TJ is required, such as in the static- 
COHSEX approximation 27 1 or when utilizing a GPP model [H to extend 
the dielectric response to non-zero frequencies. However, we can also do a 
more refined calculation. Options are given in the code so that the dielectric 
matrix is computed directly at real frequencies without extrapolation, as is 
formally required in the Dyson equation. We use in the package the advanced 
and retarded dielectric functions, defined as: 
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r/a 
GG' 



^GG' - v{q+G) 



(16) 



occ cmp 



X 



E E E ^nn'iK q, G)M„„,(k, q, G') 



n n' k 



1 r 1 



1 



X 



2 [ -Enk+q — -En'k — -E =F -^nk+q ~ 



E„/k + ^ ± i5 



where E is the evaluation frequency and 5 is a broadening parameter chosen 
to be consistent with the energy spacing afforded by the k-point samphng 
of the calculation, using the upper (lower) signs for the retarded (advanced) 
function. In principle, one must converge the calculation with respect to 
increasing the k-point sampling and decreasing this broadening parameter. 

In the epsilon code, we compute Eq. [16] on a grid of real frequencies, 
E, specified by a frequency spacing, a low-frequency cutoff, a high-frequency 
cutoff and a frequency-spacing increment. We sample the frequency on the 
real axis uniformly from to the low-frequency cutoff with a sampling rate 
given by the frequency spacing. We then increase the frequency spacing by 
the step increment until we reach the high-frequency cutoff (Fig. [7]). In gen- 
eral, one also must refine this frequency grid until convergence is reached, 
but we find that for the purpose of calculating band gaps of typical semi- 
conductors, a frequency spacing of a few hundred meV and a high-frequency 
cutoff of twice the dielectric energy cutoff is sufficient, though it should be 
noted this energy can be quite high {e.g. the case of ZnO j34|). 

In cases where the computation of the "full-frequency" dielectric response 
function is required, the bottleneck of the calculation often does become the 
iV^ summation step of Eq. [TH This is because the computation of the matrix 
elements, Eq. [HI needs only be done once, whereas the summation must be 
done for all frequencies separately. Because of this, a full-frequency epsilon 
calculation of between 10-50 frequencies costs only twice the time of a static 
epsilon calculation, but the cost scales linearly with frequencies after this 
point. 

4-3. Computation of the Self- Energy: sigma 

The sigma executable takes as input the inverse epsilon matrix calcu- 
lated from the epsilon executable and a suitable set of mean-field electronic 
energies and wavef unctions. It computes a representation of the Dyson's 
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Figure 6: Example output plotted from EpsDyn file showing the computed eoo(w) in ZnO. 
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Figure 7: Schematic of the frequency- grid parameters for a full-frequency calculation in 
epsilon. The open circles are a continuation of the uniform grid that are omitted above 
the low-frequency cutoff. 
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equation, Eq. ^ in the basis of the mean-field eigenf unctions through the 
computation of the diagonal and off-diagonal elements of S: 



(^„k|^^'^^(^)|V^mk)= (17) 

where E is an energy parameter that should be set self-consistently to the 
quasiparticle eigenvalues, E^ and ipnk are the mean-field eigenvalues and 
eigenvectors and is a mean-field approximation to the electronic self- 
energy operator, such as V^c in the case of a DFT starting point. 

It is often the case that the mean-field wavefunctions are sufficiently close 
to the quasiparticle wavefunctions [l[ that one may reduce Eq. [T7]to include 
only diagonal matrix elements. In this case the user may ask for only diagonal 
elements, and the quasiparticle energies will be updated in the following way: 

El = + (V^nklS (E) - S^^ (E) (18) 

The mean field in Eq. [18] and Eq. [17] can be DFT within the LDA or 
GGA schemes as well as within a hybrid-functional approach. In the LDA 
case, for example, (E) = V^c, is local and energy-independent. The 
starting mean-field calculation can also be an approximation to the Dyson's 
equation, Eq. ^ such as Hartree-Fock (the zero- screening limit) or static 



COHSEX (the static-screening limit) |35l . [36], |27(]. The use of these mean 



field starting points for construction of Eq. [T7] and Eq. [18] is classified as a 
one-shot GqWq calculation (the subscript means that both G and W are 
constructed from the mean-field eigenvalues and eigenvectors). One also can 
start from a previous iteration of GW in an eigenvalue or eigenvector self- 



consistency scheme j36|, |37|. In this case, the 'MF' superscripts in Eq. [T8]and 
[17] should be renamed "previous" to designate the self-consistency process. 

The Sigma executable itself can evaluate the matrix elements of S in 
Eq. [TH] and Eq. [T7] within various approximations: Hartree-Fock, static 
COHSEX, GW within a GPP model and full-frequency GW. 

For GW and static-COHSEX calculations, S can be broken into two parts, 
S = Ssx + ScH, where Ssx_is the screened exchange operator and Sch is 
the Coulomb-hole operator jll, [g], [2i[. These are implemented in the sigma 
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executable in the following way for a full-frequency calculation: 



in 



k| Ssx(i5) |n'k) = -J2Yl ^n"ni^^ -q> -G)M„.„,(k, -q, -G') (19) 

n" qGG' 

X [eGC'V' (q; ^ - K"k-q)t^(q+G') 



and 



nk| EcH(i?) Iri'k) = ^J2Y1 -q, -G)M„»„,(k, -q, -G') (20) 



n" qGG' 



X 



fe..]- (q;g)- fee-]- (q;g-) , 

where M is defined in Eq. |8] and e'' and e*^ are the retarded and advanced 
dielectric matrices defined in Eq. [16] [38^]. In practice the sigma executable 
computes the matrix elements of bare exchange, Ex and of Ssx — ^x, where 
the matrix elements of Sx are obtained by replacing [cgg']"^ (qj ~ -E'ri/'k-q) 
with 5g,g' in Eq- [IH] (as given by Eq. [30] below). The integral in Eq. [20] 
over frequency is done numerically on the frequency grid used in the epsilon 
executable (Fig. [7]). 

For GPP calculations, the corresponding expressions used in the code are: 

occ 

{nk\ Esx{E) |n'k) = -J2J2 ^n"ni^^ -q, -G)M„„„,(k, -q, -G') (21) 

n" qGG' 



I^GG' + 



^GG'(q) (1 -^tan(/)GG'(q)) 



i;(q+G') 



and 



k| EcniE) \n'k) = IJ2Y1 ^^n"ni^^ -q, -G)M„»„,(k, -q, -G') (22) 

t;(q+G') 



n" qGG' 

X 



^GG'(q) (1 -^tan0GG'(q)) 



^^GG'(q) (^-^n"k-q-^GG'(q)) 

where fiGG'(q)) '^GG'(q)) -^GG'(q) 0GG'(q) ^-re the effective bare plasma 
frequency, the GPP mode frequency, the amplitude and the phase of the 
renormalized f2^j_,,(q) Isoj defined as: 



GG'iqj = ^p 



2 (q+G)-(q+G') p(G-G') 



|q+G| 
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P(0) 



(23) 



COS0GG'(q) 

|AoG'(q)l e^^-'(-) = , ^H^""^ (25) 

OGG'-ecG/lq; 0) 

Here, p is the electron charge density in reciprocal space and = 47rp(0)e^/m 
is the classical plasma frequency. In this case, the integral over energy that 
is necessary in the full-frequency expression, Eq. |20l is reduced to a single 
term using an analytical approximation to the frequency dependence of the 
dielectric matrix requiring only the static dielectric matrix Cqq, (q; 0) in Eq. 
[2H The analytical approximation is done using the /-sum rule for each GG' 
pair as described in Ref. This reduces the computational cost of evalu- 
ating the S matrix elements by a factor of the number of frequencies. It is 
important to note that for systems without inversion symmetry, p in Eq. [23] 
and Vxc in Eqs. [T7| and [TS] are complex functions in reciprocal space (even 
though these are real functions when transformed to real space). For systems 
with inversion symmetry, f2^j-,,(q) and oj^^gjX^) TeaA, 0GG'(q) = or tt 
and Eqs. [21] - [25] reduce to a simpler form pj . 

In computing the sums in Eqs. [2T] and [22] we drop terms in certain 
circumstances to save time and improve numerical precision. We neglect the 

terms for which |5gg'— eGG'(qi 0) I ' I^GG'lq)! oi' |cos0GG'(q)l are less than a 
given tolerance, since these terms have a vanishing contribution to the matrix 
elements of the self energy. This avoids ill-conditioned limits due to some 
of the intermediate quantities here being undefined. Another case is when 
for an occupied state n", E — -En'/k-q— ^gg'{^) ~ 0, in which case the GPP 
factors in Ssx and Sch each diverge, although the sum 



2^ 

2wgg' {E - E„//k-q+ Wgg' (q ) ) 



. , ^GG'(q) (i-^ta'^<^GG'(q)) 

^GG' + 71=, p; — : } — rr i^Dj 



remains finite. In this situation, we do not calculate these terms in Ssx and 
ScH separately, but assign the sum of the contributions to Ssx- When n" is 
unoccupied there is only a Sch contribution which diverges. Similarly, there 
are divergent contributions to Ssx when E — En"k^q+ uJcdq,) ~ 0. In the 
full-frequency integrals in Eqs. [19] and [20], we can see that the contributions 
around a pole of in this case vanish, so the correct analytic limit of 
these terms is zero |l[. 
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For static COHSEX calculations, the expressions used in the code are: 

G)M„»„,(k,-q,-G') (27) 



and 

(nk| Sch(O) |n'k) = ^n"ni^^ -q> -G)M„.„,(k, -q, -G') (28) 

n" qGG' 

X KG'(q;o)-<^GG']t^(q+G') 

= IY1 q = 0, G' - G) [e^'G,(q; 0) - 5gg'] t;(q+G') 

qGG' 

(29) 

where Eqs. [27] and [28] can be derived formally from Eqs. [2T] and [22] by 
setting {E — -Eri"k-q) to zero. Using the completeness relation for the sum 
over empty states, Eq. [28] can be written in a closed form given by Eq. [291 
which now does not involve the empty orbitals. 

For Hartree-Fock calculations, we compute the matrix elements of bare 
exchange: 

occ 

(nk| Sx \n'k) = -J2Yl ^n"niK -q, -G)M„.„,(k, -q, -G')5GG'^(q+G') 

n" qGG' 

(30) 

In principle, the inner and outer orbitals used in Eqs. [TH] - [30] originate 
from the same mean-field solution. However, there is an option in the sigma 
executable to use a different mean-field solution for the inner and outer states. 
This is useful if one wishes to construct the S operator within one mean field 
but expand the S matrix using different orbitals, i.e., in order to evaluate 
matrix elements in a different basis than the mean-field wavefunctions when 
the quasiparticle wavefunctions are significantly different. This is also useful 
for verifying the accuracy of the linearization approximation as given by Eq. 
[3T] below. 

Eq. [18] depends on the evaluation energy parameter E. This parameter 
should be the quasiparticle energy E^^ , determined self-consistently. In 



occ 

(nk| Esx(O) \n'k) = -J2Yl ^n"n(k. -q> " 

n" qGG' 

xe„l,,(q;0)t;(q+G') 
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principle, what one may do is start by setting E = and find E'^^ using 
Eq. dSl One can then set E = E'^y. and solve Eq. [TSl arriving at a new 
quasiparticle energy -E^^- O'^^ then repeat this process until convergence 
is reached. This process can be achieved using the different set of inner 
and outer states as described in the previous paragraph - where the outer- 
state eigenvalues are updated after each step, and the eigenf unctions are left 
unchanged. In many cases, one can avoid this process by computing on 
a grid of energies and interpolating or extrapolating to E^^ . In particular, 
in many systems, S(-E') is a nearly linear function of E so one may compute 
E(£') for two grid points and evaluate the self-consistent E^^ using Newton's 
method [ij: 

QP _ , dJ:/dE 

= + r^rfsTZe^^"'^ ~ > ^^^^ 

The derivative that appears here is also related to the quasiparticle renor- 
malization factor: 

^ = l-dE/dE ^^^^ 

For full-frequency calculations, Eq. [19] and Eq. [20] are evaluated on a 
frequency grid, E, (not to be confused with the frequency grid over which 
the integrals are carried out) specified by the user. One then has access 
directly to Re S(a;) and to Im printed in the file spectrum. dat, which 

can be used to construct the spectral function: 

Ai^itu) = -■ (33) 
vr 



E 



llm S„k(w) 



[oo - E^^ - Re S„k(u;) + V^^y + |Im S„k(u;)r 



where we are using the mean- field exchange-correlation matrix element V^J^ = 
(nk |\4c| nk). This quantity can be used to compare directly with the quasi- 
particle spectrum from photo-emission experiments and various other mea- 
surements of the band- structure. 

The plane- wave matrix elements required in Eqs. [19] - [30] are similar to 
those of Eq. [8] required for the construction of the irreducible polarizability 
matrix. In the current case, however, we require additional matrix elements 
between valence-valence band pairs as well as conduction-conduction band 
pairs. As was the case in the epsilon executable, the matrix elements are 
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computed using FFTs utilizing the FFTW library [29[. For each pair of outer 
states, n and n', we sum over all occupied and unoccupied inner states, n", 
included in the calculation (typically states of energy up to the dielectric 
energy cutoff). Therefore, the computational cost of computing all the nec- 
essary matrix elements scales as N"^ log N , where N is the number of atoms 
(a factor of A^logA^ comes from the FFTs). If one is interested in all the 
diagonal matrix elements, Eq. [THl in a given energy range (as opposed to 
just a fixed small number of states - e.g. VBM and CBM) then an additional 
factor of N is included in the scaling which becomes log A^. If one requires 
both diagonal and off-diagonal elements within a given energy window (such 
as in a self-consistent GW scheme), then the scaling becomes A^^logA^. 

Once the plane-wave matrix elements have been computed, the summa- 
tions in the Coulomb-hole terms of Eqs. [19] - [30] for a particular ra, n' pair 
scale individually as N^. Again, if all diagonal or off-diagonal matrix ele- 
ments of S in a given energy window must be computed an additional factor 
of or A^^ respectively is added to the scaling. 

It is important to point out that the Coulomb-hole summations in terms 
of Eqs. [in] - [SU] converge exceptionally slowly with respect to the number of 
empty states included in the sums. The highest empty state included should 
have energy of at least the dielectric energy cutoff. Additionally, the conver- 
gence of the sums in [19] - [30] should be tested with respect to the dielectric 
energy cutoff. As shown in Figure [HI the convergence with respect to the 
dielectric energy cutoff, and the corresponding number of empty states, is 
very slow in many cases. This problem is similar to convergence issues with 
respect to empty states in the epsilon executable, as was discussed above. 
However, one finds that in many cases (particularly for bulk systems) the 
final E'^^ converges much more slowly with respect to the number of empty 
states in the Coulomb-hole expression than in the polarizability expression 



34| - see Fig. [8] for a comparison of these two rates in ZnO when using 
the Hybertsen and Louie GPP model. The partial sums of the Coulomb-hole 
matrix elements with respect to number of states included in the sum is writ- 
ten to the file ch_converge.dat. Example output from ch_converge.dat is 
plotted in Fig. [HI 

4.4- Optical Properties: BSE 

The optical properties of materials are computed in the Bethe-Salpeter 
equation (BSE) executables. Here the eigenvalue equation represented by the 
BSE, Eq. [31 is constructed and diagonalized yielding the excitation energies 
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Figure 8: ZnO Convergence of the VBM within the Hybertsen and Louie GPP modeL 
(Left panel) Example output from file ch_convergence . dat showing the Coulomb-hole 
sum value vs. the number of bands included in the sum. (Right panel) The convergence 
of Eqp with respect to empty states in the polarizability sum, Eq. [71 and with respect to 
empty states in the Coulomb-hole sum, Eq. [221 The red curve shows the VBM Eqp in 
ZnO using a fixed 3,000 bands in the Coulomb-hole summation and varying the number 
of bands included in the polarizability summation. The black curve shows the VBM £'qp 
in ZnO using a fixed 1,000 bands in the polarizability summation and varying the number 
of bands included in the Coulomb-hole summation. (For interpretation of the references 
to color in this figure legend, the reader is referred to the web version of this article.) 
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and wavefunctions of the correlated electron-hole excited states. There are 
two main executables: kernel and absorption. In the former, the electron- 
hole interaction kernel is constructed on a coarse k-point grid, and in the 
latter the kernel is (optionally) interpolated to a fine k-point grid and diag- 
onalized. 

The kernel executable constructs the second term of the left-hand side 
of Eq. [3] which is referred to as the electron-hole kernel. The kernel, K, 
as implemented in the package, is limited to the static approximation, and 
contains two terms, a screened direct interaction and a bare exchange inter- 
action, K^^ = + K^, defined in the following way jsj: 

{vck\K'^\v'c'k') = (34) 

- J rfrrfrV:k(r)V'c'k'(r)W^(r,r')^:,k'(r')^.k(r') 

and 

{vck\K''\v'c'k') = (35) 
</r*V4(r)</..(r).(r.rO^.:,.(r')^V.(rO. 



These matrices are constructed on a coarse grid of k-points, in most cases the 
same grid used within the GW calculation because one must have previously 
constructed the dielectric matrix e~^(q) for q = k — k'. We calculate these 
matrices in G-space using the prescription of Rohlfing and Louie 

{vck\K'^\v'c'k') = (36) 
^M,%(k, q, G)W^GG'(q; 0)M,,„(k, q, G') 



GG' 



and 



{vck\K''\v'c'k') = (37) 
J2KciK q, G)v{q + G)M,v(k, q, G) 



where M is defined in Eq. |8]and calculated using FFTs as described above 
in Sec. 

For each k and k', we must therefore calculate all the matrix elements 
Myy/, Mcc'i and M^^. The number of valence and conduction bands required 
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to calculate the absorption spectrum within a given energy window each 
scales linearly with the number of atoms A^. So, formally we again have 
A^^ log scaling with the use of FFTs. The summations involved in Eq. 
[36] and Eq. [371 however, formally scale as A^^ since there are A^^ terms to 
compute and each involves a sum over GG' that may be done sequentially, 
first G' and then G. In practice, though, except for the largest systems 
considered, the summations require less time than the matrix elements, and 

and Nc remain small compared to the values required in the GW step, for 
example where states with energy up to the dielectric cutoff were required. 
Usually the energy window used in solving the BSE is approximately 10 eV, 
giving a spectrum converged beyond the visible region. As we discuss below, 
within the BerkeleyGW package, the parallel wall-time scales as A^ for this 
step. However, the A^^ scaling will present a considerable challenge when 
applying the code to systems of size greater than 100s of atoms. 

As was the case for GW code, the q — )■ limit must be handled carefully 
and differently depending on the type of screening in the system. For the 
exchange kernel, we zero out all G = G' = contributions to the kernel 



matrix elements, as discussed in Ref. |40| which gives directly Im eM where 



eM is the macroscopic dielectric constant. For the direct term, however, we 
must handle the G = case specially. For these purposes, the G = 
and G' = terms are removed from Eq. [36] and treated separately. For 
each (kcv, k'c't;') we save three terms: the body term, which contains the 
result of the sum in Eq. [2S] with the G = or G' = terms removed; the 
wing term, which contains all the sum of all the remaining terms in the sum 
with the exception of the single term where G = G' = 0; the head term, 
which contains the remaining term from the sum where G = G' = 0. For 
metallic systems, as we discussed above, e~"'^(q, G = G' = 0) oc l/f(q) so 
that W{ci, G = G' = 0) oc C, thus the head term in the kernel remains well 
behaved. For semiconductors however, W{q,G = G' = 0) oc 1/g^ so that 
the head term in the kernel actually diverges as 1/g^ when q — > 0. Similarly, 
the wing term diverges as 1/g when q — )■ for semiconductors, while it again 
remains well behaved for metals. These limits are summarized in Table [2] 

Because exciton binding energies and absorption spectra depend sensi- 
tively on quantities like the joint density of states, it is essential in periodic 
systems to sample the k-points on a very fine grid. Directly calculating the 
kernel on this fine grid in the kernel executable would be prohibitively ex- 
pensive, so instead we interpolate the kernel in the absorption executable 
before diagonalization. For semiconductors, the head and wing kernel terms 
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are not smooth functions of k and k' (as we have shown above, they di- 
verge for q = k — k' — i- 0). Therefore, the quantities that we interpolate 
are ■ K^^, q ■ -ft'^jng and the body term directly as they are now smooth 
quantities |8|. For metals, we interpolate directly the kernel without any 
caveats because all the contributing terms are smooth functions of k and k'. 
As in GW, we treat metals with zero-temperature occupations. 

The absorption executable requires both coarse- and fine-grid wavefunc- 
tions as input. The interpolation is done through a simple expansion of the 
fine-grid wavefunction in terms of nearest coarse-grid wavefunction: 

«nkfi = 5ZC^:°,M„'k.„ (38) 

n' 

where kco is the closest coarse-grid point to the fine-grid point, kg, and the 
coefficients C^\, are defined as the overlaps between the coarse-grid and 
fine-grid wavefunctions: 

C^n' = I drun^,iv)u:,^jr). (39) 

The coefficients C^™/ are normalized so that IC^""/]^ = 1. It should be 
noted that for a given set of fine bands one can improve the interpolation 
systematically by including more valence and conduction bands in the coarse 
grid due to the completeness of the Hilbert space at each k. It should also 
be noted that we do restrict n and n' to be either both valence or both 
conduction bands - this is acceptable due to the different character of the 
conduction and valence bands in most systems. 

Using these coefficients we interpolate the kernel with the following for- 
mula: 

{vckfi\K\v'cVf^) = (40) 

E ^c'n° QnTC^:''^;C^.tn4(^2nikeo|K|n4n3k^o) 

where K is one of the head, wing, body or exchange kernel terms. As in 
the case of epsilon, this summation can be performed compactly as a set of 
matrix-matrix multiplications. We utilize the Level 3 BLAS calls DGEMM and 
ZGEMM to optimize the performance. 

One can improve on the interpolation systematically by using the closest 
four coarse-grid points to each fine point and using a linear interpolation 
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layer in addition to the wavefunction-based interpolation described above. 
This is done by default for the interpolation of the first term of Eq. [3] for the 
quasiparticle self-energy corrections E^^ — E^^: 

i^r(kfi) = (41) 

\ n' 

where the brackets indicate linear interpolation using the tetrahedron method. 
In this case, the wavefunction-based interpolation layer guarantees that the 
band crossings are properly handled, and the linear interpolation layer en- 
sures that we correctly capture the energy dependence of the self-energy 
corrections. In this way, we can construct E^^ on the fine grid, or any arbi- 
trary point, given E^^ on the fine grid and E^^ and E^^ on the coarse grid 

(Fig. ED. 

As an alternative to calculating the quasiparticle corrections on the coarse 
grid and interpolating them to the fine grid, the user may choose a less refined 
method of specifying the corrections using a three-parameter model involving 
a scissor-shift parameter AE to open the energy gap at the Fermi energy, a 
zero energy Eq (typically the band edge), and an energy-scaling parameter C 
changing the bandwidth (the parameters are specified separately for valence 
and conduction bands): 

^QP = ^MF ^ ^ ^ ^^MF _ ^ ^42) 

Having constructed the kernel on the fine grid, we now consider the di- 
agonalization of the kernel. The kernel matrix is of dimension Nc ■ N^; ■ Nk 
where is the number of k-points on the fine grid. Formally, the kernel 
dimension scales as for periodic systems with small unit cells and A^^ for 
large systems, where A^ is the number of atoms. For bulk systems with small 
unit cells, oc 1/A^, but the reduction with increased cell size saturates 
quickly for large systems and large molecules where A^^ = 0(1) (to compute 
a smooth continuum absorption onset, it is necessary to include some level 
of k-point sampling even for isolated systems). The matrix can be diago- 
nalized exactly within LAPACK (zheevx) or ScaLAPACK (pzheevx). The 
diagonalization therefore scales as A^^ for periodic systems with small unit 
cells and A^^ for large systems. 

The result of the diagonalization is the set of exciton eigenvalues fl^ and 
eigenf unctions A^^^, which can be used to construct the joint density of states 
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Figure 9: Top: GW quasiparticle self-energy corrections, E'^^ _£'LDA ^^^^ LDA energy 
for (10,0) SWCNT. Both a rigid opening of the band gap and a non-linear energy scaling 
are present. Bottom: The fine-grid quasiparticle band-structure using the interpolated 
self-energy corrections (black open) and the LDA uninterpolated band-structure (red). 
256 points are used to sample the Brillouin zone in the Coulomb-hole summation. (For 
interpretation of the references to color in this figure legend, the reader is referred to the 
web version of this article.) 
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(JDOS) or the absorption spectrum (or Im £2 (w)) using Eq. El There are 
a number of post-processing tools in the package, such as PlotXct, which 
plots the exciton wavefunction in real space according to Eq. |l]to analyze the 
exciton states. The absorption executable can produce both the singlet and 
triplet eigenvalues and eigenvectors. In the latter case, the exchange term 
is set to zero when diagonalizing the kernel j^. (Note that for triplets the 
oscillator strengths calculated in the code are evaluated without considering 
spin overlap. In some cases this "oscillator strength of the corresponding 
singlet" may be useful. The true physical oscillator strength of course is zero 
for triplets.) Additionally, one may compute the eigenvalues and eigenvectors 
with only the exchange interaction; the resulting spectrum should be the 



same as the one obtained within RPA with local-field effects included [40 . 

The scaling for large systems in the diagonalization is, in practice, 
much more limiting than the A^^ step in the construction of the kernel. This 
is because the latter step can be parallelized very efficiently, while diagonal- 
ization, even with the use of ScaLAPACK, typically saturates at 0(1000) 
CPUs. Often one is only interested in the absorption spectrum or the JDOS, 
and not all of the correlated exciton eigenfunctions and eigenvalues. For 



such systems, we use the Haydock recursion method |42|. This is an 
iterative method based on spectral decomposition and requires only matrix- 
vector products, which can be parallelized efficiently. This method gives the 
absorption spectrum directly, the equivalent of Eq. O In principle, one can 
get eigenvalues and eigenvectors for a small energy range of interest using 



iterative Lanczos algorithms [42 



As mentioned above, the electron-hole kernel should be constructed with 
a sufficient number of valence and conduction bands to cover the energy win- 
dow of interest - typically all bands within the desired energy window from 
the Fermi energy should be included so that the energy window of the bands 
included in the calculation is at least twice that of the desired absorption 
energy window. The absorption executable computes the percent deviation 
from the /-sum rule 

e2{u)ujdu= ^. (43) 

One should converge this quantity with both the number of valence and 
conduction bands included. The absorption spectrum (or €2) in the energy 
window of interest converges much more quickly than ei if high-energy tran- 
sitions outside of the window of interest contribute greatly to the sum rule. 
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since ei^u) is related to an integration over all frequencies of €2(00) via the 
Kramers-Kronig relation. 

Finally, the transition matrix elements (OlvIS*) in Eq. |5]are printed in the 
file eigenvalues.dat. These are related to the oscillator strengths fs by 

/. ^ (44) 

We compute the velocity matrix element via the commutator of the many- 
body Hamiltonian, as follows [43| : 

(0 |v| S) = (0 \t [H, r]\S)=t (Eo - Es) (0 |r| S) 

= -tn'J2^vc^.{v^\r\ck) (45) 

In a periodic system, we cannot calculate matrix elements of the position 
operator, but we can use a q — )■ limit jsj: 

/ 1 , , , N ^■ (^^k + q|e^'i-'-- l|ck) 
(vk r ck) = hm 

q^o iq 

In practice, we evaluate the limit using finite differences for a small value 
of q, similarly to how the limit is treated in the Epsilon code. Thus the 
valence bands on a shifted fine k-grid are required. (Note we are assuming 
an interband transition.) 

As an alternative to the finite-difference approach, one may approximate 
the velocity operator by the momentum operator — iV, avoiding the calcula- 
tion of the valence bands on the shifted fine grid. We reverse the derivation 
partly: 

{vk I [H^'', r] I ck) _ _ . (t;k|v^^|ck) _ (t;k|V|ck) 
(f k |r| ck; — — i j^p j^p ^ j^p ^^p [Al) 

^vk - ^ck Kk - ^ck ^vk - ^ck 

This does not require an additional grid, but yields inexact oscillator strengths 
due to neglect of commutators between r and the non-local part of the Hamil- 
tonian [sl, |4J]. We could have approximated the quasiparticle or excitonic 



velocity operator by the momentum operator, but this would be less accu- 
rate, since those Hamiltonians have additional sources of non-locality beyond 
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those of the mean-field Hamiltonian. The momentum operator uses transi- 
tions fk — > ck in the Bethe-Salpeter equation (Eq. [3D, whereas with the 
velocity operator we actually use the transitions i;k + q — )■ ck. This en- 
sures consistency of wavefunctions between excitons and transition matrix 
elements, and also is needed to describe transverse and longitudinal excitons 
in materials with an indirect gap jsj. 

5. Parallelization and Performance 
5.1. epsilon 

The parallelization of epsilon is characterized by two distinct schemes 
for the two main sections of the code: (1) the computation of matrix elements 
(Eq. |8]), and (2) the matrix multiplication (Eq. [T^ and inversion. 

For the computation of the matrix elements in Eq. |8l the code is par- 
allelized with nearly linear scaling up to ■ Nc processors, where and 
Nc are the number of valence and conduction bands respectively used in the 
sum of Eq. [71 Each processor owns an approximately equal fraction of the 
total number of {v, c) pairs for all k, and performs serial FFTs to compute 
the matrix elements, Eq. [HI for all G and k associated with the pair. Note 
that for large systems, Ny is on the order of 100s and Nc is on the order 
of 1000s or more, so that this section of the code scales well up to 100,000 
CPUs. 

All wavefunctions are stored in memory unless the optional coiniii_disk flag 
is given. Each processor holds in memory the wavefunctions for all the pairs it 
owns. If comin_disk is specified (as opposed to the default comm_mpi option), 
the distribution of pairs is the same, but each processor saves the conduction 
wavefunctions it needs on disk and reads the wavefunctions back into memory 
one pair at time for the purposes of computation. Using comm_disk can 
therefore reduce the amount of memory required for the computation, but 
comes with a substantial performance reduction. 

The processors are distributed into valence and conduction band pools in 
order to minimize the memory required using a complete search algorithm. 
For example, if one calculates a material with few valence and many conduc- 
tion bands, all the processors will be in one or two valence pools (holding all 
or half the valence bands in memory each) but spread over a large number of 
conduction pools because the relative cost of holding all the valence bands in 
memory is much smaller than holding all the conduction bands in memory. 
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In such a scheme, the amount of memory required per processor drops hn- 
early with small numbers of processors and decreases as 1/ a/ Npj-oc for large 
numbers of processors (Fig. [T0|) . 

In the second section of the epsilon code, we switch from a parallelization 
over bands to a parallelization scheme over GG' for the polarizability (Eq. [7]) 
and dielectric matrices (Eq. We use the ScaLAPACK block-cyclic layout 



45[ in anticipation of utilizing the ScaLAPACK libraries for the inversion 
of the dielectric matrix. The transition between the band distribution of 
the matrix elements in Eq. [8] and the block-cyclic layout of the polarizability 
matrix is achieved naturally in the process of doing the parallel matrix-matrix 
multiplication involved in Eq. [TH There is, however, a significant amount 
of communication involved at this step. To minimize this communication we 
have two options for the parallel multiplication. 

In the first scheme, corresponding to the use of the gcoimmnatrix flag in 
epsilon. inp, we loop over processors (for simplicity, we label the loop index 
i) who own a piece of the polarizability matrix x(G, G'). Each processor does 
the fraction of the matrix multiplication in Eq. [13] relating to the (n, n') 
pairs it owns and for submatrix x(Gi, G-) that the ith processor stores. The 
processors then MPI-reduce their contribution to the ith processor. Thus 
the total communication in this scheme is an eventual reduction of the entire 
x(G, G') to the processors that store it. It is important to note that we must 
do this one processor at time (or at most in chunks of processors - chosen 
often as the number of CPUs per node) because no single processor can hold 
in memory the entire x(G, G') matrix for large systems. 

In the second scheme, corresponding to the use of the gcoinm_elements 
flag in epsilon. inp, we again loop over processors, but this time, we have 
the ith processor MPI-broadcast to all processors that hold a piece of the 
polarizability matrix the set of matrix elements for all the {v, c) pairs it owns. 
Each processor then uses these matrix elements to compute the contribution 
of the matrix-matrix product, Eq. [TU for the submatrix of x(G, G') it stores. 
In this scheme, all the matrix elements (Eq. |8]) are eventually broadcast. 

Whether the use of gcoinm_elements or gconim_matrix is optimal depends 
on whether it is faster to reduce the x(G, G'; w) or broadcast all the matrix 
elements, M„„'(k, q, {G}). In particular if A^g ■ Afreq < N.^ ■ Nc ■ Nk, where 
Afreq is the uumbcr of frequencies in a full frequency calculation, then it is 
cheaper to use gcommjaatrix. If no flag is specified, the epsilon code will 
make this choice for the user based on the above criteria. 

Because we use the block-cyclic layout, the memory required to store the 
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Figure 10: The memory required per CPU vs. the number of CPUs used for a epsilon 
calculation on the (20,20) nanotube. See text for parameters used. 

x(G, G') decreases linearly with the number of CPUs. However, the cost of 
the inversion utilizing ScaLAPACK can saturate at 100s of CPUs and the 
cost of the summation can saturate with a few thousand CPUs, see Figure 
[TTl In general, the number of CPUs used for the block-cyclic distribution of 
X can be tuned. 

Beyond the more sophisticated level of parallelization described above, 
there is a more trivial level of parallelization available to small systems re- 
quiring large numbers of k-points: Eq. [7] is completely separable as a function 
of q. One may run a separate epsilon calculation for each q required and 
merge the dielectric matrices - in such a way, a user can obtain perfectly 
linear artificial scaling with CPUs to A^^ times the number CPUs mentioned 
above. 

The scaling of memory and computation time with respect to the num- 
ber of CPUs used per q-point in epsilon for the example (20,20) SWCNT 
calculation is shown in Fig. [10] and Fig [TTl We find nearly linear scaling up 
to 3200 CPUs per q-point. Since there are 32 q-points in this calculation 
that are trivially parallelized, we find nearly linear scaling of the epsilon 
computation up to ~100,000 CPUs. 

5.2. Sigma 

Within a sigma calculation, one computes a requested number of diag- 
onal, Eq. [iHl or off-diagonal, Eq. [T71 E matrix elements. For each matrix 
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Number of CPUs per q-point 

Figure 11: The wall-time required vs. the number of CPUs per q-point used for a epsilon 
calculation on the (20,20) single-walled carbon nanotube. There is near linear scaling 
up to 1,600 CPUs. Since there is an additional layer of trivial parallelization over the 
32 q-points required, the epsilon calculation scales to over 50,000 CPUs. See text for 
parameters used. 

element there are two computationally intensive steps. The first is to cal- 
culate all the plane- wave matrix elements Mnn" and Mn'n", Eq- 13 for the 
outer states of interest, n and n', and for all occupied and empty states, n". 
Secondly, we compute the sum over states, n", as well as G, G' and q in the 
expressions in Eqs. fT9]-l30l 

The Sigma execution is parallelized over both outer bands, n and n', and 
inner bands, n". As in the case of epsilon, this is done by defining pools and 
distributing the n, n' pairs evenly among the pools. We then distribute the 
n" bands evenly within the pools. As was the case for epsilon, we define the 
number of pools using a complete search algorithm to minimize the amount 
of memory per CPU required to store the inner and outer wavefunctions. 

As described above, the CPU time required for the computation of all 
plane- wave matrix elements, Mnn" and Mn'n"-, scales as A^^logA^, where N 
is the number of atoms, for each S matrix element of interest. As described 
above, the outer-state pairs are parallelized over pools and the inner states 
are parallelized over the CPUs within each pool. The wall-time for the com- 
putation of all the plane-wave matrix elements required for every E matrix 
element scales as A^logA^ with unlimited CPU resources. As was the case 
in the epsilon executable, each CPU computes the plane-wave matrix el- 
ements between all the (n, n") and (n', n") pairs it owns for all G through 
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Figure 12: The memory required per CPU vs. the number of CPUs used for a sigma 
calculation on the (20,20) nanotube. See text for parameters used. 



serial FFTs using FFTW |29 . 



The summations required in Eqs. [19] - [30] are parallelized by again dis- 
tributing the outer-state pairs over the pools and then distributing the inner 
states over the CPUs within each pool. The wall-time for the summations, 
therefore, scales as N'^ (for the sums over G and G') regardless of the num- 
ber of diagonal or off-diagonal elements requested, given unlimited CPU re- 
sources. 

As was the case for epsilon, the wavefunctions are distributed in memory, 
with each CPU owning only the n, n' and n" wavefunctions that it needs for 
the computations described above. The dielectric matrix, Cq^q, (q; E) for 
each q and i?, is distributed globally over the matrix rows, G. 

The scaling of memory and computation time with respect to the num- 
ber of CPUs used per k-point in sigma for the example (20,20) SWCNT 
calculation is shown in Fig. [12] and Fig [13] We find nearly linear scaling up 
to 1600 CPUs per k-point. Since there are 16 irreducible k-points in this 
calculation that are trivially parallelized, we find nearly linear scaling of the 
sigma computation up to 25,000 CPUs. 

5.3. BSE 

As mentioned in the previous sections, in the kernel executable, for 
each k and k', we must calculate all the matrix elements M„„i, Mcc', and 
M^c and then perform the summations involved Eq. [3S] and Eq. [37] for 
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Figure 13: The wall-time required vs. the number of CPUs per k-point used for a sigma 
calculation on the (20,20) single-walled carbon nanotube. There is near linear scaling up 
to 1,920 CPUs. Since there is an additional layer of trivial parallelization over the 16 k- 
points required, the sigma calculation scales to over 30,000 CPUs. See text for parameters 
used. 

each (f ck, v'dk.') pair. BerkeleyGW automatically parallelizes this in different 
schemes depending on the system and number of CPUs provided. 

If the number of CPUs is less than iV|, the square of the number of coarse 
k-points, we distribute the (k, k') pairs evenly over the CPUs and each CPU 
calculates all the matrix elements, M„„/, Mcc', and My^, required for the k- 
point pairs it owns through serial FFTs. It then computes the sums in Eq. 
|36] and Eq. |37] for all of its pairs. 

If the number of CPUs is greater than A'^^, as is often the case for large 
systems and molecules, but less than N^-N^, we distribute the (ck, c'k') pairs 
evenly among the processors, first distributing the processors evenly over k- 
point pairs and then creating pools to distribute the (c, c') evenly among the 
pools. In this scheme, each CPU computes Mcc' for only the (ck, c'k') it owns 
but computes M^^i and M^c for all v and v' at each (ck, c'k') pairs it owns. 
Each CPU does the summations in Eq. |36] and Eq. |37] for all v, v' for the 
(ck, c'k') it owns. 

If the number of CPUs is greater than N^. ■ , we distribute the entire set 
of [vck, f 'c'k') pairs out evenly among the processors, first distributing the 
processors evenly over (ck, c'k') pairs and then creating pools to distribute 
the (f , f ') evenly among the pools. In this scheme, each CPU computes only 
the Myy/, Mcc', and M^c for the (fck, f'c'k') pairs it owns. Additionally, each 
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Figure 14: (Left) Memory per CPU required vs. the number of CPUs for a kernel 
calculation on the (20,20) SWCNT. (Right) The wall-time required vs. the number of 
CPUs used for a kernel calculation on the (20,20) SWCNT. The parameters used are 
described in the text. 

CPU does the summations in Eq. [36] and Eq. [37] for only the pairs it owns. 

In the last scheme the calculation of the matrix elements has a parallel 
wall-time scaling of and the summations scale as A^^ (accounting for the 
sum over GG') if a limitless number of CPU resources is assumed. 

The large arrays that must be stored in memory are the dielectric matrix, 
the wavefunctions and the computed kernel itself. The computed kernel is 
distributed evenly in memory among the processors and computed directly 
by the CPUs which own the various {vck, v'c'k.') pairs. The dielectric matrix 
is distributed, as in the sigma executable, over its rows G and must be 
broadcast during each calculation of the sums in Eq. [36] and Eq. [37] It is 
for the purposes of minimizing the communication of the dielectric matrix 
that we use the three different parallelization schemes above - i.e. so that we 
might work on the biggest blocks of {v,v') and (c, c') at once. For example, 
in the first scheme, where the number of CPUs is less than iV|, we do the 
sums in Eq. [3S] and Eq. [33 for all (f c, v'c') at once, so that we need only 
broadcast the dielectric matrix one time. 

The wall-time scaling for the example kernel (20,20) SWCNT calculation 
is shown in Fig. [TJ] We see nearly linear scaling up to 1024 CPUs - which 
is the square of the number of k-points, 32. 

In the absorption executable, as described above, the first computational 
challenge is the interpolation of the kernel from the coarse k-grid onto the 
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fine k-grid. The computation of tlie interpolation coefficients, Eq. [321 is 
done by distributing the fine-grid k-points evenly among the processors. For 
molecules or other large systems, this does not represent a problem because 
the computation of the coefficients is very quick in these cases regardless of 
the lack of parallelization. 

The parallelization of the kernel interpolation is different from the paral- 
lelization scheme in the kernel executable. However, the parallelization is 
also described by three different schemes: 

First, if the number of CPUs is less than N^, where Nk is the number 
of fine-grid k-points, we distribute the Nk k-vectors on the fine grid evenly 
among the CPUs. Each processor owns all the {vck, v'c'k.') pairs consistent 
with the k-vectors it was assigned. It then performs the interpolation in Eq. 
|40] serially by replacing the simple loops that represent the sums with four 
matrix-matrix multiplications. For example, for each rii and ria, one can 
write the sum over n2 as a matrix-matrix product between the coarse kernel 
matrix (whose outer dimension is n^) and the [C^^j] matrix (whose outer 
dimension is iV„ on the fine grid). We write the remaining sums as a similar 
matrix- matrix product. 

If the number of CPUs is greater than Nk but less than Nk ■ Nc, we 
distribute the Nk ■ Nc k-point and conduction-band pairs evenly among the 
processors. Again, each processor owns all the {vck, v'c'k.') pairs consistent 
with the (k, c) it was assigned. In the previous scheme, we utilized matrix- 
matrix products to interpolate many kernel elements at once. This required 
the allocation of an array of size iV^ ■ N^ as described above. In the present 
scheme, we avoid storing large intermediate arrays by doing more of the sums 
in Eq. HO] as simple loops rather than matrix products. By default we do 
two simple loops of two matrix products. However, if the user selects the 
low_memory option, we do all four summations as four nested loops with- 
out the aid of matrix-matrix multiplications, obtaining one fine-grid matrix 
element at a time without the need for any intermediate matrices. 

If the number of CPUs is greater than Nk-N^, we distribute the Nk-Nc-N, 
k-point and conduction- and valence-band pairs evenly among the processors. 
Each processor owns all the {vck, v'c'k') pairs consistent with the (k, c, v) it 
was assigned. The interpolation is done exactly as in the previous case. 

Once the fine-grid kernel has been constructed, in the absorption ex- 
ecutable, the matrix is diagonalized using ScaLAPACK with a block-cyclic 
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layout [45[. This diagonalization scales well to 0(1000) CPUs but saturates 



quickly beyond this point. In order to calculate the absorption spectrum as 
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per Eq. El we compute all the necessary matrix elements, Eq. HHl by dis- 
tributing the fine-grid k-points evenly over processors. In order to diagonalize 
large matrices (e.g., graphene on a 256 x 256 k-point grid |46|), we turn to 



iterative methods, in particular Haydock recursion [41|, |42| . Since it requires 
only matrix-vector products, this method scales well to larger numbers of 
processors. It should be pointed out that the kernel matrix is not sparse in 
general, so methods designed for the diagonalization of sparse matrices are 
not appropriate here. 

6. Coulomb Interaction 

The bare Coulomb interaction is used in many places throughout the code. 
In all cases, the ID array f (q+G) is generated from a single vcoul_generator 
routine in the Common directory. However, there is a lot that can be specified 
about the Coulomb interaction in the code. The Coulomb interaction can be 
truncated to eliminate the spurious interaction between periodic images of 
nano-systems in a super-cell calculation. One can implement a cell- averaging 
technique whereby the value of the interaction at each q-point (or q — )■ 
in particular) can be replaced by the average of f (q + G) in the volume 
the q-point represents. Finally, this average can be made to include the q- 
dependence of the inverse dielectric function also if W is the final quantity 
of relevance for the application - such as in the evaluation of W for the self 
energy. 

The BerkeleyGW package contains five general choices for the Coulomb 
interaction. Firstly, one may choose to use the bulk, untruncated value 
expressed in Eq. [TOj There are in addition 4 choices of Coulomb interaction 
that truncate the interaction beyond a certain cutoff in real space, of generic 
form 

Mr) ^ ^ (48) 

r 

where / is some function that describes the geometry in which the inter- 
action is truncated. The four choices available implement the methods of 
Ismail-Beigi ^7|: Wigner-Seitz slab truncation, Wigner-Seitz wire trunca- 
tion, Wigner-Seitz box truncation, and spherical truncation. The Wigner- 
Seitz box truncation and the spherical truncation truncate the Coulomb in- 
teraction in all three spatial directions, yielding a finite value of ft(q = 0). 
All the Wigner-Seitz truncation schemes truncate the interaction at the edges 
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of the Wigner-Seitz cell in the non-periodic directions. Slab truncation is in- 
tended for nano-systems with slab-like geometry. The Coulomb interaction 
is truncated at the edges of the unit cell in the direction (z) perpendicular to 
the slab plane (xy) . Wire truncation is intended for nano-systems with wire- 
hke geometry. The Coulomb interaction is truncated at the edges of the unit 
cell in the two directions (xy) perpendicular to the wire axis (z). Spherical 
truncation allows the user to specify manually a spherical truncation radius 
outside of which the Coulomb interaction will be truncated. 

Like the untruncated interaction, the slab-truncation and spherical-truncation 
schemes have the benefit that Vt{q + G) can be constructed analytically: 

<'(q) = ^ • (1 - cos(r, • q)) (49) 
Q 

'(q) = ^ ■ (1 - cos(?, • z,)) (50) 

where and z^. are the truncation distances in the radial and perpendicular 
directions, respectively. The wire-truncation and box-truncation on the other 
hand are computed numerically through the use of FFTs. First, the trun- 
cated interaction, {2KQ{\qz\p) for wire geometry where p — yjx^ ^y^ and 
is the modified Bessel function and 1/r for box geometry), is constructed 
on a real-space grid in the Wigner-Seitz cell and folded into the traditional 
unit cell. The real-space grid is typically more dense than the charge density 
grid used in DFT calculations. The density of points on the real-space grid 
relative to the charge density grid in xy directions for wire-truncation and in 
xyz directions for box-truncation is set to be a factor of 4 and 2, respectively. 
The origin of the Coulomb potential is offset from the origin of the coordinate 
system by half a grid step to avoid the singularity. We then FFT to yield 
the Vt(q) directly. The FFT is done in parallel. For wire-truncation, xy- 
planes are evenly distributed among processors and each processor performs 
2D-FFTs in xy-planes it owns. For box-truncation, the parallel 3D-FFT is 
performed as follows: xy-planes are distributed and 2D-FFTs in xy-planes are 
carried out the same way as for wire-truncation; the data is then transferred 
from a;|/-planes to z-rods which also are evenly distributed among processors; 
finally, each processor performs ID-FFTs in z-rods it owns. After the FFT, 
the origin of the Coulomb potential is shifted back to the origin of the co- 
ordinate system by multiplying Wt(G) with exp(i27rG • |d) where d is the 
real-space grid displacement vector. 
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As mentioned above, for all interaction choices with the exception of cell- 
box and spherical truncation, the Coulomb interaction diverges as q — )■ 0. 
For the case of no truncation, f (q — )■ 0) oc for slab truncation, t>^''^'^(q — >• 
0) oc for wire truncation, v^"'^{c[ — )■ 0) oc — ln(g). As we mentioned in 
Sec. \A.2\ this divergence is handled in epsilon by taking a numerical limit 
- that is, evaluating e at a small but finite qo- For sigma and absorption, 
on the other hand, we are interested in evaluating directly W^gg'(^) matrix 
elements and the appropriate treatment is to replace the divergent (for non- 
metals) Vr(q — )■ 0) with an average over the volume in reciprocal space that 
q = represents: 

^GG' (q = 0) = / d-q (q) (51) 

■^^ J cell 

where "cell" represents the volume in reciprocal space closer to q = than 
any other q-points, and d"'q represents the appropriate dimensional differen- 
tial for the truncation scheme (e.g., d'^q for slab truncation). Eq. |5T] yields 
a finite number in all truncation schemes even when W{q — ?■ 0) itself is 
divergent because of the reduced phase-space around q — )■ 0. 

For metallic systems, it is particularly important to use Eq. [51] to average 
W, as opposed to averaging t>(q) and e~^(q) separately, since for metals 
e~^(q) has inverse q-dependence from f (q) at small q, yielding a constant 
limit: W^^^'^^{c{ — )■ 0) = C. The user can tell the code which model to use 
for the q-dependence of by specifying one of the screening flags in the 
input files. The q — )■ limits for the inverse dielectric function and screened 
Coulomb interaction are enumerated in Table [2] for the semiconductor and 
metallic screening types. 

As shown in Fig. [T51 including in the cell- averaging scheme also 
is important when using a truncated interaction where e(q = 0) = 1 but 
quickly rises by the first non-zero q-point [43]. The figure shows e~^{q) 
along the tube axis in the (14,0) SWCNT. e~^(0) = 1 but decreases nearly 
by half by the first non-zero grid point included in a 1 x 1 x 32 sampling 
of the first Brillouin zone. e"^(0) = 1 is a general property of truncated 
systems with semiconductor-type screening since the q^ dependence of the 
polarizability approaches faster than the Coulomb interaction diverges at 
g — )■ 0. BerkeleyGW uses this ly-averaging procedure by default for cases 
with truncated Coulomb interaction. 

In general, using an extension of Eq. [51] even for q, GG' ^ can speed 
up the convergence of a sigma or absorption calculation with respect to the 
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Figure 15: £^^{q) in the (14,0) single-walled carbon nanotube for q along the tube axis as 
reported in the epsilon.log output file. The circles represent q-grid points included in 
a 1 X 1 X 32 sampling of the first Brillouin zone. 

number of q-points required in the calculation. This can be explained easily 
by the fact that one is replacing a finite sum over q-points with an integral 
- mimicking a calculation on a much larger set of q-points or a much larger 
unit cell. The user can ask that BerkeleyGW use the cell-averaged W for all 
q and G below an energy cutoff specified in the input file. 

The averaging is implemented in the code using a Monte Carlo integration 
method with 2,500,000 random points in each cell. 

6.1. Grid Uniformity 

Another important consideration in performing integrals over q of the 
Coulomb interaction in sigma and kernel is the sampling of the grid on 
which the integral is done. If the sampling in different directions is very 
different, then the result of the integrals will not go to the correct limit, 
since it will resemble a ID or 2D integration rather than a 3D integration. 
This issue is important for calculation of nano-systems without truncation, or 
for highly anisotropic crystals. We determine the effective sampling in each 
direction as follows: take the vectors bi/Ni, where bi is a reciprocal lattice 
vector and iVj is the number of q-points in that direction. Find the shortest 
vector. Orthogonalize the next shortest vector to that one. Orthogonalize 
the remaining vector to the first two. (It is important to use this order 
since orthogonalization always makes the vector shorter.) Now compare the 
lengths of these orthogonalized vectors: if the ratio between the longest and 
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Table 2: Top: q — >■ limits of the head eoo(*l)' wing eQQ(q), and wing' epQ,(q), of the 
inverse dielectric matrix. Bottom: q — >■ limits of the head and wings of the screened 
Coulomb interaction, IVgg'Iq)- 
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shortest is too large (we use a factor of 2 as a tolerance), the grid is non- 
uniform and may give incorrect answers. The code will write a warning in 
this case, and the user should try to use a more nearly uniform grid, or check 
the convergence of results against the cell-averaging cutoff. Note that the 
sampling in any direction in which the Coulomb interaction is truncated is 
irrelevant when checking for grid uniformity. 

7. Symmetry and degeneracy 

7.1. Mean field 

As was mentioned in the Introduction, the largest cost when performing a 
GW calculation with the BerkeleyGW package is the generation of the input 
mean- field states. In order to reduce this cost, all the codes allow the user 
to input the wavefunctions in only the reduced Brillouin zone and construct 
the wavefunctions in the full zone by the following relation: 

0R(k)(G) = 0k(R-'(G))e-^^-- (52) 

where the symmetry operation is defined by a reciprocal-space rotation ma- 
trix R and a fractional translation r such that x' = R^^x + r. 

The main k-grid used is generated by constructing a uniform grid (with 
a possible shift, typically half a unit as in Monkhorst-Pack grids) and then 
reducing by the symmetry operations. The shifted grid used in epsilon 
for constructing epsOmat at a small q-vector is generated by unfolding this 
reduced set of points with all the symmetry operations, reducing again by 
the symmetry operations of the subgroup of the q-vector, and then applying 
a small q-shift. (Note that, contrary to a naive expectation, this reduced 
shifted grid may contain many more points than the original uniform grid.) 
An example for graphene is given in Fig. [161 This procedure provides all the 
k-points needed for calculations in BerkeleyGW, and it is implemented in the 
utility kgrid.x in the MeanField/ESPRESSO directory, which should be used 
to generate the set of k-points in mean-field calculations. PARATEC also has 
built-in support for this construction. The symmetry analysis in kgrid.x, 
epm2bgw.x, and siesta2bgw.x is performed with the spglib library [48(]. 

7.2. Dielectric matrix 

The wavefunctions in the full zone then are used for the sum over k 
in Eq. [7] in epsilon, sigma, kernel and absorption, which requires not 
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Figure 16: An example of the construction of 4 x 4 main and shifted k-grids for graphene. 
(Left) The main grid has a (0.5, 0.5) shift (crystal coordinates). There are 16 points in 
the full Brillouin zone and 6 irreducible points. (Right) The shifted grid has a (0.0, 0.05) 
shift (crystal coordinates). There are 48 points in the full Brillouin zone and 26 irreducible 
points. 
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only the wavefunctions in the full zone but also the dielectric matrix. While 
in principle one could construct the dielectric matrices at all the q-points 
required in Eqs. [19] - [30] and [36l in practice one can use symmetry to reduce 
the required q-points. The epsilon code requires the user to calculate the 
dielectric matrices on a reduced set of q-points and the other codes generate 
the dielectric matrices in the full zone. If one defines qi = R(q) + G^, where 
Gr is a G-vector chosen to ensure that q and qi are in the first Brillouin 
zone, and R, r are rotation matrix and translation respectively, then one can 



where Gi = R ^(G + Gr). Given e ^ in the reduced zone, this allows one 
to construct it in the full zone. 

7. 3. Truncation of sums 

Both the dielectric matrix and the self-energy operator involve infinite 
sums over unoccupied states, which must in practice be truncated in the 
epsilon and sigma codes. The dielectric matrix and self-energy operator 
only retain the full symmetry of the system if the truncation does not cut 
through any degenerate subspaces. Consider a subspace of states belong- 
ing to a degenerate representation of a symmetry operation. Only the whole 
subspace is invariant under that operation, while just a part of it is not neces- 
sarily. As a result, a calculation using only part of the subspace will produce 
self energies that break degeneracies due to that operation. Moreover, the 
actual values obtained are not well defined because the states used are arbi- 
trary linear combinations in the subspace, which could even differ from run 
to run of a DFT code depending on how the calculation is initialized. These 
considerations are particularly acute for sums over a small number of states, 
since the contribution of the last few bands may be significant. Therefore, 
the epsilon and sigma codes check that the highest band requested for the 
sum is not degenerate with the next one, and block calculations that will 
break degeneracy. However, it is also possible to override this behavior with 
the flag degeneracy_check_override, for testing purposes and because in 
some cases there may be overlapping degenerate subspaces on different It- 
points that make it difficult to find acceptable numbers of bands; for large 
numbers of bands, the effect of truncation in a degenerate subspace will be 
small. 





(53) 
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H- Self- energy operator 

Degeneracy is also important from the point of view of the states on which 
self energies are calculated, as opposed to those appearing in the sum. Since 
the self-energy operator has the full symmetry of the system, the matrix 
elements between states belonging to different representations are zero by 
symmetry. In the presence of high symmetry, this consideration can make 
the matrix quite sparse. To take advantage fully of symmetry here would 
require a careful analysis of each wavefunction's behavior under various sym- 
metry operations and comparison to character tables of space groups. Users 
certainly can do this in deciding which off-diagonal self-energy matrix ele- 
ments to calculate. The Sigma takes a very simple approach to identify some 
of the elements which are zero by symmetry, based on degeneracy. The mul- 
tiplicity of the degenerate subspace to which each states belongs is counted 
(1, 2, or 3 for the standard space groups), and clearly two states in subspaces 
of different multiplicity must belong to different representations, and their 
matrix element can be set to zero without calculation. This saves time and 
enforces symmetry. 

Application of symmetry in a degenerate subspace can also speed up 
calculation of diagonal elements of the self-energy operator. The expressions 
for the exchange, screened exchange, and Coulomb-hole parts contain a sum 
over q. In general, this must be done over the whole Brillouin zone, but 
to calculate the sum of the self energies within a degenerate subspace it is 
sufficient to use the irreducible part of the Brillouin zone. Each part of E, in 
the various approximations, has the generic form 



(nk| E \n'k) |e^^^+^>""| n"k - q> (n"k - q 

n" qGG' 

X F (q, G, G') 



g-i(q+G')-r 



n'k 

(54) 



The summand is invariant under application of a symmetry operation O 
in the subgroup of k provided that n = n' and n and n" are non-degenerate, 
since in that case the action of the operation simply introduces a phase: 
O \mk) = e*^ \mk) (degenerate states may instead transform into linear com- 
binations in the degenerate subspace) . These phases are canceled by the fact 
that each state appears also with its complex conjugate. If the states n" 
in the sum are degenerate, the summand is not invariant but the sum is, if 
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the whole degenerate subspace is summed over, since then we are taking the 
trace of the projector matrix |n"k) (n"k| in that subspace, which is invari- 
ant [iij. If n is degenerate, then (?T,k| E |nk) is not invariant, but the trace 
of the self-energy in the degenerate subspace, (nk| S |nk), is invariant. 
Therefore, to calculate diagonal elements for a whole degenerate subspace, 
for each state we sum only over q in the irreducible zone, with weight Wq 
from the number of q- vectors related to q by symmetry. We then symmetrize 
by assigning the average to each: 



(m. 



^ deg 

k| E |mk) = -— {nk\ E \nk) 



dcg irr 

= - E E E E (^k |e^(''+«)-| n"k - q> (55) 

n n" GG' q 



(This averaging over degenerate bands is also done to enforce symmetry 
even when we use the full q-sum, since the results may differ slightly due 
to limited precision in the wavefunctions from the mean- field calculation.) 
If we are calculating only part of a degenerate subspace, this trick does 
not work, and we must perform the complete sum. For diagonal elements, 
the code by default uses the irreducible q-sum and will write an error if 
the calculation requires the full sum because of degeneracy, directing the 
user to enable it via the flag no_symmetries_q_grid, or include all states 
in the degenerate subspace. For off-diagonal elements (n 7^ n"), even if 
both are non-degenerate, application of the symmetry operation, in general, 
introduces different phases from the two states, which are not canceled. Thus 
the contributions from different q-points related by symmetry differ, so that 
the full sum must always be used. 

7.5. Bethe-Salpeter equation 

Degeneracy must be considered in BSE calculations as well, when choos- 
ing the subspace in which to work. If the set of occupied or unoccupied 
states includes only part of a degenerate subspace, then the solutions found 
by absorption will break symmetry and can give qualitatively incorrect re- 
sults. For example, an excitation that should have zero oscillator strength by 
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symmetry, due to interference between transitions to two degenerate states, 
may not be dark if only one of those transitions is included. This issue is 
quite general and applies to the choice of active spaces in other theories as 



well, such as configuration interaction |50|. Breaking degeneracy in either 
the coarse or fine grid can also cause trouble in interpolation of the kernel 
and quasiparticle energies. 

7.6. Degeneracy utility 

We provide a utility called degeneracy_check.x which reads wavefunc- 
tion files and writes out a list of acceptable numbers of bands. Multiple 
wavefunction files can be checked at once, for example the shifted and un- 
shifted grids in Epsilon or shifted, unshifted, coarse, and fine grids for Bethe- 
Salpeter equation calculations, in which case the utility will identify numbers 
of bands which are consistent with degeneracy for every file. 

7. 7. Real and complex flavors 

The component executables come in two "fiavors," real and complex, 
specified at compile time and denoted by the suffix .real.x or .cplx.x. 
When the system has inversion and time-reversal symmetry, we can choose 
the wavefunctions to be real in reciprocal space. The plane-wave expansions 
are: 



u[r) = Y^ Ugc'^-^ (56) 

G 

w(-r) = 5^WGe-^«-^ (57) 

G 

«*(r) = E"Ge^^'''' (58) 

G 

The symmetry conditions mean that wavefunctions can be chosen to satisfy 
u (— r) = au (r) (inversion symmetry) and u* (r) = bu (r) (time-reversal, 
equivalent to taking the complex conjugate of the Schrodinger equation), 
with a, b each equal to ±1 depending on whether the wavefunction belongs 
to an odd or even representation. Thus we can choose u (— r) = cu* (r) with 
c = ab also equal to ±1. Combining this with the plane- wave expansions, 

^«^e-G- = c^w^,e-«- (59) 



G 



ug = (60) 
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The choice c — 1 corresponds to real coefficients; c — —1 corresponds to pure 
imaginary coefficients. Most plane-wave electronic-structure codes always use 
complex coefficients, and so the coefficients will in general not be real, even in 
the presence of inversion and time-reversal symmetry. For a non-degenerate 
state, the coefficients will be real times an arbitrary global phase, determined 
by the initialization of the solution procedure. We must divide out this global 
phase to make the coefficients real. In a degenerate subspace, the states need 
not be cigenstates of inversion, and so in general they may not just be real 
times a global phase. Instead, in each subspace of degeneracy n we take the 
2n vectors given by the real and imaginary parts of each wavef unction, and 
then use a Gram-Schmidt process to find n real orthonormal wavefunctions 
spanning the subspace. 

The density and exchange-correlation potential arc real already in the 
presence of inversion symmetry and there is no arbitrary phase possible. The 
real-space density is always real: p (r) = p* (r). With inversion symmetry, 
we also have p{r) — p (— r). In reciprocal space, 

P(r) = $^PGe^°- (61) 

G 

P*ir) = J2phe-'''-'' (62) 

G 

p{-r)^J2PG^~'''"' (63) 

G 

Together, these relations imply = Pc,, i-e. the reciprocal-space coefficients 
are real. Precisely the same equations apply for the exchange-correlation 
potential. 

The wavefunction, density, and exchange-correlation potential are then 
all stored as real coefficients, saving disk space (for the files), memory, and 
operations compared to the complex representation. Usually, only the lack of 
inversion symmetry of the lattice and basis would require the use of complex 
wavefunctions, but if spin-orbit coupling or magnetic fields are present, then 
time-reversal symmetry is lost and complex wavefunctions again are required. 

8. Computational Issues 

8.1. Memory estimation 

In the beginning of each run, all the major code components print the 
amount of memory available per CPU and an estimate of memory required 
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per CPU to perform the calculation. If the latter exceeds the former, the 
job is likely to fail with a memory allocation error. The amount of memory 
required is estimated by determining the sizes of the largest arrays after read- 
ing in the parameters of the system from the input files. A straightforward 
approach to estimating the amount of available memory is to allocate mem- 
ory by incremental amounts until the allocation call returns with an error. 
Unfortunately, in many implementations the allocation call returns without 
an error even if the requested amount of memory is not physically available, 
but the system fails when trying to access this "allocated" memory. We 
implement another approach based on the Linux /proc file system. First, 
each CPU opens file /proc/meminf o and reads in the values of MemFree and 
Cached. The sum of these two values gives the amount of memory avail- 
able per node. This approach works on almost all modern high-performance 
computing systems where the Linux /proc File System is accessible. (How- 
ever, for BSD-based MacOS which lacks /proc/meminf o, we read the page 
size and number of free and speculative pages from the command vm_stat.) 
Second, each CPU calls an intrinsic Fortran routine that returns the host 
name which is unique for each node. By comparing host names reported 
by different CPUs we identify the number of CPUs per node. The amount 
of memory available per CPU is then given by the ratio of the amount of 
memory available per node to the number of CPUs per node. 

8.2. Makefiles 

The main codes are in the Epsilon, Sigma, BSE, PlotXct, and MeanField 
directories. Routines used by all parts are in the Common directory, and rou- 
tines common to some of the MeanField codes are in the Symmetry directory. 
The Makefiles are designed for GNU Make, and enable targets in a directory 
to be built from any level of the directory hierarchy. They contain a full set 
of dependencies, including those between directories, to ensure that the build 
is correct after any changes to source, for ease in development and modifica- 
tion. This also enables use of parallel make on large numbers of processors 
for rapid builds - any omissions in the dependencies generally cause a failure 
for a parallel make. The special make target all-j {i.e. make -j all-j) 
begins by using all processes to build common directories, which contain files 
required by files in a large number of directories; otherwise, the build would 
fail due to attempts by multiple processes to read and write the same files in 
these directories. Commonly, Fortran Makefiles are set up with object files 
depending on other object files. However, the real situation is that object 
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files depend on module files ( . mod) for the modules they use, and only ex- 
ecutables depend on object files. Therefore we have dependencies directly 
on the module files to ensure the required files are present for compilation, 
particularly for parallel builds. 

8.3. Installation instructions 

The code can be installed via the following steps: 

cp [f lavor_real .mk/f lavor_cplx.nik] flavor, mk 
In -s conf ig/ [mysystem] .mk arch.mk 
make all 

make check [-jobscript] 

First a fiavor is selected by copying the appropriate file to flavor. mk. 
Then a configuration file must be put as arch . mk. Configurations appropriate 
for various supercomputers as well as for using standard Ubuntu or Macports 
packages are provided in the conf ig directory. Appropriate paths, libraries, 
and compiler fiags can be set in a new arch .mk for other systems. Finally the 
test suite should be run to confirm that the build is working. In serial, the 
command is make check; for parallel builds, it is make check-parallel. 
On machines with a scheduler, a job script should be created to run this 
command. For the architectures supported in conf ig, job scripts to run the 
test suite are provided in the testsuite directory, and can be used via maJse 
check- j obscript. 

8.4. Validation and verification 

The importance of verification and validation of complicated scientific 
software packages is receiving increasing attention. We use standard open- 
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source tools for code development, following accepted best practices [51 
Development is done with the subversion (SVN) version-control system 
and Trac, an issue-tracking system and interface to SVN [53|. All code runs 
identify the version and revision number used in the output for traceability 
of results, implemented via a special source file called svninfo.f90 which 
all SVN revisions must modify (enforced via a pre-commit hook). Debug 
mode can be enabled via -DDEBUG in the arch.mk file, which performs extra 
checking including of dynamic memory allocation and deallocation. A macro 
enables a check of the status returned by the system after an allocation 
attempt, and reports failures, identifying the array name, size, source file. 



52 



and line number, as well as which processor failed to allocate the array. 
Additionally, it keeps track of the amount of memory dynamically allocated 
and deallocated, so the code can report at the end of each run how much 
memory remains allocated, and the maximum and minimum memory 'high- 
water-mark' among the processors. In debug mode, a stack trace also can 
be enabled, either on just the root processor, or on all processors (causing 
the code to run much slower), which can be used to locate where problems 
such as segmentation faults are occurring (possibly on only one processor). 
Verbose mode can be enabled via -DVERBOSE which writes extra information 
as the calculation proceeds. 

The package contains a comprehensive test suite to test the various exe- 
cutables, run modes, and options, in the testsuite directory. Calculations 
of several different physical systems, with mean-field, epsilon, sigma, and 
BSE calculations, are carried out (including use of PlotXct and some util- 
ities), detecting any run-time errors and showing any warnings generated. 
Then selected results are extracted from the output and compared to refer- 
ence information within a specified tolerance. The actual calculated values, 
as well as timing for each step, are displayed. Each match is shown as either 
OK or FAIL, and a final summary is written of failures. The calculations are 
small and generally under converged, to make them quick enough for routine 
testing and rapid feedback. The mean-field steps are either EPM (quick se- 
rial calculations) or stored compressed output from DFT calculations. The 
Epsilon, Sigma, and BSE calculations are run either in serial or on 4 pro- 
cessors (for parallel builds). 

The test suite has numerous uses. It is useful for users to verify the success 
of a new build of the code on their platform (failures could be due to library 
problems, excessive optimizations, etc.). It is used for developers to ver- 
ify that the code is giving reproducible answers, ensure consistency between 
serial/parallel runs, as well as real/complex and spin-polarized/unpolarized 
runs, and check that the code works with new compilers or libraries. On a 
routine basis, the test suite is also useful for developers to check that changes 
to the code do not introduce problems. The driver scripts (run_testsuite . sh 
and run_regression_test.pl) and specifications for the files defining the 
test steps are originally based on, and developed in conjunction with, those 
of the Octopus code [l6|,|l7|. This framework is quite general and can be used 
easily for constructing a test suite for another code. It can be run in serial 
with the command make check (or make check-save to retain the work- 
ing directories from the runs), or in parallel with make check- jobscript 
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(or make check-jobscript-save). The system configuration file arch.mk 
can specify how to submit an appropriate jobscript for parallel execution on 
a supercomputer using a scheduler. Scripts are provided in the testsuite 
directory for some supercomputers. The test suite is used with a continuous- 



integration system, the open-source tool BuildBot |5J] , to ensure the integrity 
of the code during development. Each commit to the SVN repository trig- 
gers a build of the code on each of 10 "buildslaves," which have different 
configurations with respect to serial/parallel, compilers, and libraries. After 
the build, the test suite is run. BuildBot will report to the developers if the 
either the build or test runs failed, so the problem can be quickly remedied. 
Use of the various different buildslave configurations helps ensure that the 
code remains portable across different platforms and in accordance with the 
language standards. Two of the buildslaves are on a supercomputer with a 
scheduler, a situation for which standard BuildBot usage is problematic. We 
provide a Perl script buildbot_pbs.pl that can submit jobs, monitor their 
status, capture their output for BuildBot, and determine success or failure. 
This script is general for any PBS scheduler and can be used for other codes 
too. 

8.5. Supported operating systems, compilers, and libraries 

With the test suite, we have tested the code extensively with various 
configurations, and support the following compilers and libraries: 

• Operating systems: Linux, AIX, MacOS 

• Fortran compilers (required): pgf 90, ifort, gfortran, g95, openf90, 
sunf90, pathf90, crayftn, af90 (Absoft), nagfor, xlf90 (experi- 
mental) 

• C compilers (optional): pgcc, ice, gcc, opencc, pathcc, craycc 

• C++ compilers (optional): pgCC, ice, g++, openCC, pathCC, crayCC 

• MPI implementation (optional): OpenMPI, MPICHl, MPICH2, MVA- 
PICH2 

• LAPACK/BLAS implementation (required): NetLib, ATLAS, Intel 
MKL, ACML, Cray LibSci 

• ScaLAPACK/BLACS implementation (required by BSE if MPI is used): 
NetLib, Cray LibSci, Intel MKL, AMD 
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• FFTW (required): versions 2.1.5, 2.1.5.1, 2.1.5.2 

9. Utilities 

9.1. Visualization 

Several visualization tools designed to simplify work with the DFT codes 
and the BerkeleyGW package are provided in directory Visual. These include 
the surf ace. X code and Matter library. 

Surface is a C++ code for generating an isosurface of a volumetric scalar 
field (such as the wave function, charge density, or local potential). The 
scalar field is read from a Gaussian Cube or XCrySDen XSF file, the surface 
triangulation is performed using the marching- cubes or marching-tetrahedra 
algorithms, and the isosurface is written in the POV-Ray scripting language. 



The final image is rendered using the ray-tracing program POV-Ray [55 
Running the surface code requires a fairly complicated input parameter file. 
A brief description of the input file parameters is given in the header of 
surface . cpp. 

Matter is a Python library for manipulating atomic structures with peri- 
odic boundary conditions. It can translate and rotate the atomic coordinates, 
generate super-cells, assemble atomic systems from fragments, and convert 
between different file formats. The supported file formats are mat, paratec, 
vasp, espresso, siesta, tbpw, xyz, xsf , wien and povray. mat is the native 
file format of the library, paratec, vasp, espresso, siesta, wien and tbpw 
represent the formats used by different plane-wave and local-orbital DFT 
codes and an empirical pseudopotential code, xyz is a simple format sup- 
ported by many molecular viewers, and xsf is the internal format of XCryS- 
Den [56]. povray stands for a scripting language used by the ray-tracing 
program POV-Ray. The core of the library consists of files common. py, 
matrix. py, and matter. py. Script convert. py is a command-line driver 
that performs the basic operations supported by the library. Script link.py 
is a molecular assembler that can be used to rotate and link two molecules 
together. Script gsphere.py generates a real-space grid and a sphere of G- 
vectors given lattice parameters and a kinetic-energy cutoff. This helps to 
estimate the number of unoccupied states needed in GW calculations. Script 
average. py takes an average of the scalar field on the faces or in the vol- 
ume of the unit cell. This is used to determine the vacuum level in DFT 
calculations. Script volume. py converts the a3Dr file produced by PARATEC 
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or BerkeleyGW to Gaussian Cube or XCrySDen XSF format. Each script re- 
quires a different set of command-line arguments. Running individual scripts 
without arguments displays a list of all possible command line arguments and 
a short description of each argument. 

9.2. Band- structure interpolation 

To plot the quasiparticle band-structure of a system, we provide two 
methods. The first, sig2wan is a utility that uses Wannier interpolation of 



the band-structure [57|, |58[ and is based on the Waiinier90 [59[ package. This 
utihty does not construct the Wannier functions; the user has to do that us- 
ing the package used to construct the mean-field eigenfunctions, PARATEC or 
Quantum ESPRESSO. Once the Wannier functions have been constructed, this 
utility replaces the mean-field eigenvalues in the wannier . eig file (generated 
by Wannier 90) with the quasiparticle eigenvalues. The user then can rerun 
the Wannier90 executable to generate band-structures along arbitrary di- 
rections. Because one replaces the mean-field eigenvalues with quasiparticle 
eigenvalues, this approach does not work well if one replaces only some of 
the mean-field eigenvalues with quasiparticle ones for entangled bands. The 
second method for plotting quasiparticle band-structure, the inteqp utility, 
uses Eq. Hi] to construct the band-structure along arbitrary directions. In 
this method, we interpolate directly the quasiparticle corrections, which are 
significantly smoother functions of k and E than the quasiparticle eigenvalues 
themselves. Therefore, this method requires both the mean-field eigenvalues 
and eigenfunctions along the desired band-structure direction. 

9.3. Other 

We provide a general-purpose utility called mf _convert .x which can con- 
vert between binary and ASCII formats of wavefunction, density, and exchange- 
correlation potential files. The real/complex fiavor is determined by reading 
the file header, and if the utility is called through mf _convert_wrapper . sh, 
the binary/ASCII format is detected via the grep command and need not 
be specified. This converter is useful for moving such files between different 
platforms, since the binary files are more compact and the form read by the 
code, but are not necessarily portable between different platforms, whereas 
the ASCII files are. 

The image-charge model (ICM) is implemented in a utility called icm.x, 
based on the Surface code. For a molecule weakly coupled to a metallic 
surface, the self-energy correction to a state can be well approximated by the 
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sum of the self-energy correction of that state in the isolated molecule and an 



additional term due to screening from the metal [60||. This screening term is 
modeled as the electrostatic energy of the charge density of the wavefunction 
and its induced image-charge distribution in the metal. Let the operator R 
be a reflection across an image plane. Then 
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where the plus sign applies for occupied orbitals and the minus sign for 
unoccupied orbitals. This approximation is useful for modeling scanning- 
tunneling spectroscopy of molecules absorbed on metal surfaces 6l| and for 



quantum-transport calculations of molecular junctions |62 
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11. Appendix 

11.1. Specification of file formats 

Wavefunction files are needed by all parts of the code, with various 
filenames. The epsilon executable uses an unshifted grid (WFN) and a 
shifted grid (WFNq). The sigma executable uses WFN_inner to construct the 
self-energy operator and evaluates matrix elements with WFN_outer. The 
kernel executable constructs kernel matrix elements with a coarse unshifted 
(WFN_co) and shifted grid (WFNq_co). The absorption and plotxct exe- 
cutables use a fine unshifted grid (WFN_f i) and, with the velocity operator, 
a fine shifted grid (WFNq_f i). Additionally, the sigma executable needs the 
charge- density RHO for GPP calculations, and needs the exchange-correlation 
potential VXC (unless its pre-computed matrix elements are supplied in a 
vxc.dat file). These files all share a common format, which begins with a 
header. Parts in italics are only for wavefunction files, not charge-density 
or exchange-correlation potential files. Each bullet represents a record in 
the file. The utility wfn_rho_vxc_inf o.x can read the information from 
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the header and report it in a comprehensible format to the user. A mod- 
ule of driver read/write routines for the formats specified here is available 
in the library directory, for mean-field codes to use in writing output for 
BerkeleyGW. This library is used by Octopus and PARATEC. 

• [WFN/RHO/VXC] - [Real/Complex] date time 

• number of spins, number of G-vectors, number of symmetries, 

[0 for cubic symmetry/1 for hexagonal symmetry] , number of atoms, 
charge-density cutoff (Ry) , number of k-points , number of bands, 
maximum number of G-vectors for any k-point, wavefunction cutoff 

(Ry) 

• FFT grid(l:3) , k-grid(l:3), k-shift(l:3) 

• real-space cell volume (a.u.), lattice constant (a.u.), lattice 
vectors(l:3, 1:3) in units of lattice constant, real-space metric 
tensor(l:3, 1:3) (a.u.) 

• reciprocal-space cell volume (a.u.), reciprocal lattice constant 
(a.u.), reciprocal lattice vectors(l:3, 1:3) in units of reciprocal 
lattice constant, reciprocal-space metric tensor(l:3, 1:3) (a.u.) 

• symmetry rotation matrices(l:3, 1:3, 1: number of symmetries) 
in reciprocal-lattice basis 

• symmetry fractional translations (1 : 3 , 1: number of symmetries) 
in units of lattice vectors times 27r (see also Sec. 17 . ID 

• atomic positions(l : 3, 1: number of atoms) in units of lattice 
constant, atomic numbers (1 : number of atoms) 

• number of G-vectors for each k-point(l : number of k-points) 

• k-point weights (1 : number of k-points) from to 1 

• k-point coordinates (1 : 3, 1: number of k-points) in crystal 
coordinates 

• index of lowest band to use on each k-point (1 .-number of k-points) 
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index of highest occupied band on each k-point(l : number of 
k-points) 



• energy eigenvalues (1 : number of bands, l:number of k-points, 
1: number of spins) (Ry) 

• occupations (1 -.number of bands, l:number of k-points, 1 .-number 
of spins) from to 1 

In the body of a file, G- vectors are listed as (1:3, l:ng), expressed as 
integers in reciprocal lattice units, and data is listed as (l:ng, l:number of 
spins). G- vector components should be chosen in the interval [— n/2,n/2) 
where n is the FFT grid. A full sphere must be used, not a half sphere as in 
the Hermitian FFT representation for a real function. Each set is preceded 
by an integer specifying how many records the G-vectors or data is broken 
up into, for ease of writing files from a code parallelized over G-vectors. 
Wavef unction files follow the header with a listing of all the G-vectors; for 
each k-point, there is first a list of G-vectors, and then the wavefunction 
coefficients for each band. RHO and VXC files have instead just one listing 
of G-vectors and coefficients after the header. The wavefunction coefficients 
must be normalized so that the sum of their squares is 1. The RHO coefficients 
are normalized such that their G = component is the number of electrons 
in the unit cell. The VXC coefficients are in Ry. 

The recommended scheme is to use pre-computed exchange matrix el- 
ements in an ASCII file vxc.dat, because VXC is only applicable to a lo- 
cal exchange-correlation functional. Hartree-Fock, hybrid-density function- 
als and self-consistent static GW (COHSEX) calculations do not fall in this 
category. For Hartree-Fock and some hybrid functionals (PBEO, B3LYP), 
one can still use VXC if one sets bare_exchange_f raction in sigma.inp to 
compensate for a fraction of the bare exchange which is not included in VXC. 
Matrix elements are in eV and are always written with real and imaginary 
parts (even in the real version of the code) . The vxc . dat file may contain any 
number of k-points in any order. It contains a certain number of diagonal 
elements (ndiag) and a certain number of offdiagonal elements (noffdiag). 
The x . dat file shares the same format and can supply saved matrix elements 
of bare exchange instead of calculating them. Each k-point block begins with 
the line: 

kx, ky, kz [crystal coordinates] , ndiag*nspin, nof f diag*nspin 
There are then ndiag*nspin lines of the form 
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ispin, idiag, Re (idiag|V|idiag) , Im (idiag|V|idiag) 
There arc then nof f diag*nspin Unes of the form 

ispin, ioffl, ioff2, Re (iof f l|V|iof f 2) , Im (iof f l|V|iof f 2) 
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